•2 gostaram•1,251 visualizações

Baixar para ler offline

Denunciar

This presentation is Part 2 of my September Lisp NYC presentation on Reinforcement Learning and Artificial Neural Nets. We will continue from where we left off by covering Convolutional Neural Nets (CNN) and Recurrent Neural Nets (RNN) in depth. Time permitting I also plan on having a few slides on each of the following topics: 1. Generative Adversarial Networks (GANs) 2. Differentiable Neural Computers (DNCs) 3. Deep Reinforcement Learning (DRL) Some code examples will be provided in Clojure. After a very brief recap of Part 1 (ANN & RL), we will jump right into CNN and their appropriateness for image recognition. We will start by covering the convolution operator. We will then explain feature maps and pooling operations and then explain the LeNet 5 architecture. The MNIST data will be used to illustrate a fully functioning CNN. Next we cover Recurrent Neural Nets in depth and describe how they have been used in Natural Language Processing. We will explain why gated networks and LSTM are used in practice. Please note that some exposure or familiarity with Gradient Descent and Backpropagation will be assumed. These are covered in the first part of the talk for which both video and slides are available online. A lot of material will be drawn from the new Deep Learning book by Goodfellow & Bengio as well as Michael Nielsen's online book on Neural Networks and Deep Learning as well several other online resources. Bio Pierre de Lacaze has over 20 years industry experience with AI and Lisp based technologies. He holds a Bachelor of Science in Applied Mathematics and a Master’s Degree in Computer Science. https://www.linkedin.com/in/pierre-de-lacaze-b11026b/

Seguir

Modern Convolutional Neural Network techniques for image segmentationGioele Ciaparrone

2.9K visualizações•59 slides

- 1. Deep Learning Pierre de Lacaze rpl@lispnyc.org Lisp NYC Tuesday, June 20th, 2017 Jane Street Capital
- 2. Overview Principal Topics 1. Convolutional Neural Networks (CNNs) 2. Recurrent Neural Networks (RNNs) Time permitting… 1. Generative Adversarial Networks (GANs) 2. Differentiable Neural Computers (DNCs) 3. Deep Reinforcement Learning (DRL)
- 3. Deep Neural Networks • A deep neural network is a neural network with multiple layers of hidden units. – E.g. MLPs: Multi-Layered Perceptrons (MLPs) • Convolutional Neural Nets (CNNs) – Biologically-inspired variants of MLPs – Successfully used in image recognition, speech recognition • Recurrent Neural Nets (RNN) – Cyclic graphs where next layers feeds into previous layers – Allow for a window of time into past data – Successfully used or Natural Language processing.
- 4. Application: Combining CNNs & RNNs GENERATING IMAGE DESCRIPTIONS Together with convolutional Neural Networks, RNNs have been used as part of a model to generate descriptions for unlabeled images. It’s quite amazing how well this seems to work. The combined model even aligns the generated words with features found in the images. Deep Visual-Semantic Alignments for Generating Image Descriptions. Source: http://cs.stanford.edu/people/karpathy/deepimagesent
- 5. Part 0 ANN Review & Multi-Layered Perceptrons (MLPs) Multi Layered Perceptrons (MLPs) are fully connected feed forward networks with several layers of hidden units.
- 6. Linear Units and Perceptrons • Linear Unit: A linear combination of weighted inputs (real-valued) • Perceptron: Thresholded Linear Unit (discrete-valued) Note: w0 is a bias whose purpose is to move the threshold of the activation function.
- 7. Multi Layered Perceptrons • These are fully connected Deep Feed Forward Networks • Every output from previous layer is connected to every unit in the next layer • They are typically trained using the Backprogation Algorithm • Backprogation is effectively Gradient Descent applied to every unit in the network. Image Credit: Michael Bernstein, Neural Networks and Deep Learning, Chapter 2.
- 8. Gradient Descent Motivation Weight Space Error Surface
- 9. ANN Backpropagation Algorithm (Using incremental gradient descent) 1. Initial weights to small random numbers 2. Until termination criteria for each training example a. Compute the network outputs for the training example b. For each output unit k compute its error: δk = ok (1 – ok) (tk – ok) c. For each hidden unit h compute its error: δh = oh (1 – oh) Σ (whk δk ) k d. Update each network weight wij wij = wij + η δh xij
- 10. Thoughtful Reminder Slide Show Code Examples
- 11. Identity Function Example • Tom Mitchell, Machine Learning, Chpt 4., 1st edition. (def if-td [[[1 0 0 0 0 0 0 0] [1 0 0 0 0 0 0 0]] [[0 1 0 0 0 0 0 0] [0 1 0 0 0 0 0 0]] [[0 0 1 0 0 0 0 0] [0 0 1 0 0 0 0 0]] [[0 0 0 1 0 0 0 0] [0 0 0 1 0 0 0 0]] [[0 0 0 0 1 0 0 0] [0 0 0 0 1 0 0 0]] [[0 0 0 0 0 1 0 0] [0 0 0 0 0 1 0 0]] [[0 0 0 0 0 0 1 0] [0 0 0 0 0 0 1 0]] [[0 0 0 0 0 0 0 1] [0 0 0 0 0 0 0 1]]]) • Ran 3 examples of MLPs on Identity function. – A 1 hidden layer MLP: 8 x 3 x 8 – A 2 hidden layer MLP: 8 x 3 x 3 x 8 – A 3 hidden layer MLP: 8 x 3 x 3 x 3 x 8
- 12. MLP Training Comparisons ❶ MLP with 1 hidden layer of 3 hidden units: 4,500 iterations to converge ❷ MLP with 2 hidden layers of 3 hidden units: 28,000 iteration to converge ❸ MLP with 3 hidden layers of 3 hidden units: 1,000,000+ iterations to converge
- 13. Part 1 Convolutional Neural Nets (CNNs) Convolutional Neural Networks are biologically-inspired variants of Multi Layered Perceptrons (MLPs)
- 14. History of CNNs • Research dates back to the 1970’s • Seminal Paper on CNNs: – Gradient-based learning applied to document recognition, Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, 1998 • Really took off in 2012 – ILSVRC (ImageNet Large-Scale Visual Recognition Challenge) – 2012 ILSBRC: AlexNet , Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton – 2013 ILBSRC: ZF Net, Matthew Zeiler and Rob Fergus , NYU – 2014: VGG Net, Karen Simonyan and Andrew Zisserman, University of Oxford
- 15. CNN Overview • A CNN typically consists of one or more convolutional and sampling layers followed by one or more fully connected layers. • Specifically designed to exploit 2D input such an image or speech input • Faster to to train than fully connected networks. • Sparse Connectivity – CNNs exploit spatially-local correlation using local connectivity pattern between units of adjacent layers. – These are called local receptive fields • Shared Weights – Replicated units share the same parameterization (weight vector and bias) and form a feature map. • Max Pooling – A form of non-linear down-sampling. Max pooling partitions the input image into a set of non-overlapping rectangles and, for each such sub-region, outputs the maximum value.
- 16. Local Receptive Fields • In a fully connected network, every input in the input layer is connected to every hidden unit. • This prevents the network from learning spatial features of the image. • The idea is to map (connect) small rectangular sections of the image (inputs) to different hidden units. • These hidden units are called local receptive fields and result in a sparse connectivity between the input layer and the first hidden layer. • The stride length is the amount by which we shift the rectangular sections. Typically use rectangular sections shifted over 1 pixel • Different sets of local receptive fields form feature maps each of which represent a potentially different feature.
- 17. Feature Maps • Each hidden unit shares the same set of weights and bias but for a different spatial area of the input. • This allows that layer to learn the same feature but for different regions of the image. • The complete hidden layer will in fact consist of several feature maps. This is called a convolutional layer. • The shared bias and weights in each feature map are often called filters or kernels.
- 18. How Feature Maps Work The amount by which the local receptive field is shifted is called the stride length. A stride length of 1 is common. All hidden units in a feature map share the same weights and bias. This greatly reduces the number of parameters in a layer. Image credit: Michael Nielsen’s Neural Networks and Deep Learning, Chapter 6.
- 19. Why Do Feature Maps Learn Different Features? • From Quora: Andy Thomas • Two reasons: – The weights of the filters are randomly initialized – Different feature maps reduce the cost function • Random initialization of the weights will likely ensure each filter converges to different local minima in the cost function. It is very unlikely that each filter would begin to resemble other filters, as that would almost certainly result in an increase of the cost function and therefore no gradient descent algorithm would head in that direction. • Some feature maps may learn the same feature.
- 20. The Convolution Operator • A Convolution is a simple mathematical operation common to many image processing operators. • Provides a way of “multiplying” two arrays of numbers of different sizes but same dimensionality • Input image has M rows and N columns, and the kernel has m rows and n columns, • The output image will have M - m + 1 rows, and N - n + 1 columns. • The purpose of Convolution in a CNN is to extract features from the input image. • Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data
- 21. Output of the Convolutional Layer • For each hidden unit in each feature map, only take into account pixels in the local receptive field (sparse connectivity) • For each feature map, for the jth ,kth hidden unit in that feature map, assuming a 5x5 filter (aka kernel), the output of that unit is given by: –σ (b + ∑ l=0,4 ∑ m=0,4 wl,m a j+l,k+m)
- 22. Pooling • A pooling layer typically follows a convolutional layer. • Intuitively it is a down sampling of the previous layer. • Max pooling is technique that selects the maximum activation from a set of units from the convolutional layer. • Effectively take each feature map from convolutional layer and produce a reduced feature map. • Other pooling techniques: – L2 Pooling • Takes the square root of the sum of the squares of a set of units
- 23. How Pooling Works • Pooling is a form of statistical aggregation or downsampling of the previous layer. • Pooling layers do not learn anything • While it is common, it is not required to have a pooling after a convolutional layer Image Credit: Michael Nielsen, Neural Networks and Deep Learning, Chapter 6
- 24. Backpropagation in CNNs Overview • Applying backprogation to a convolutional layer is very similar to applying backprogation to a fully connected except that errors and gradients are computed separately for each filter. • Applying backpropagation to a pooling layer involves using an upsampling function which propagates the error over the sampling function using its derivatives. • Backpropagation for a fully connected layer is exactly the same as for MLPs. • Yoshua Bengio on Quora: “There is a general recipe for obtaining a back-propagation algorithm associated with ANY computational graph. You can find it described in my book, for example, in the feedforward nets (mlp) chapter (6): DEEP LEARNING”
- 25. Backpropagation in CNNs • Error and gradient for fully connected layers • Error and gradient for convolutional layer • k indexes the filter number and upsample propagates error through pooling layer)
- 26. Slides from Hiroshi Kuwajima (visiting scholar at Stanford)
- 27. MNIST Data Set • National Institute for Standards and Technology (NIST) • Modified NIST Data Set maintained by Yan LeCun • MNIST Data in CSV format
- 28. A Simple Architecture for MNIST Image Credit: Michael Bernstein, Neural Networks and Deep Learning, Chapter 6. • Input layer: 764 inputs encode the MNIST image • Convolutional layer: 1728 units representing 3 feature maps • Max-Pooling layer: 432 units representing 3 feature maps • Output layer: 10 units, one for each digit MNIST dataset
- 29. Shared Weights and Training CNNs • CNN – 28×28 = 784 input neurons – 20 feature maps 20×26=520 – Total of 520 weights to learn. • MLP – 784=28×28 inputs, – 30 hidden units, – Total of 784×30 weights = 23520 – Total of 30 biases, – Total of 23,550 weights to learn. • A single fully-connected layer would have more than 40 times as many weights as the convolutional layer.
- 30. A CNN Architecture for MNIST Image Credit: Michael Nielsen, Neural Networks and Deep Learning, Chapter 6 • 9,967 Test images correctly classified out 10,000 • Very similar to LeNet-5 architecture • Softmax Regression aka Multi-class Logistic Regression is a generalization of logistic regression that is used for multi-class classification and based of the softmax function.
- 31. Incorrectly Classified MNIST Images Of the 10,000 MNIST test images 9,967 correctly classified, 33 incorrectly classified
- 32. What features are learned? • The images above show the type of features the convolutional learns. • Lighter regions mean a smaller, typically negative weight, • Darker region mean a larger weight • Many of the features have distinguishable sub-regions of light and dark • It’s clear that it’s learning “stuff” related to spatial structure
- 33. Performance Enhancements • Regularization Terms to help with overfitting – Regularization is technique that allows you to penalize your loss function. • Ensemble methods – Train several nets and have them vote on the output. • Generative expanded data sets – Basically apply distortions to original data set – E.g. 50,000 images 250,000 images
- 34. Expanded Generated Data Sets Image credit: Tijmen Tieleman, University of Toronto
- 35. CNN Summary • There are four main operations in a CNN: – Convolution – Non Linearity (ReLU) – Pooling or Sub Sampling – Classification (Fully Connected Layer) • These operations are the basic building blocks of every CNN. • CNN’s Faster to train than MLPs because fewer parameters need to be learned. • Work well with two-dimensional data in which locality is meaningful, – e.g. object recognition in images. • CNN can also be used with higher dimensional data – e.g. MRI Images • Addition convolutional layers provide higher level features (meta features) • Pooling layers progressively reduce the spatial size of the representation to reduce the amount of features and the computational complexity of the network • Fully Connected layer at the end provides the classifier • Rectified Linear Units (ReLU) typically outperform networks based on sigmoid activation functions (sigmoid or tanh).
- 36. Part 2 Recurrent Neural Nets (RNNs) Recurrent Neural Networks are a family of Neural Networks for procession sequential data.
- 37. Recurrent Neural Nets Overview • Leverage the ideas – unfolding computational graphs – parameter-sharing to abstract away input position • “In 2009 I visited Nepal” vs “I visited Nepal in 2009” • RNNs represent cyclical graphs so information flows in both directions through the network. – They are networks with loops in them, allowing information to persist. • Different flavors of RNNs – An output at each time-step and recurrent connections between hidden units – An output at each time-step and recurrent connections only from output units – An output only after the entire sequence is fed into the network and connections between hidden units. • RNNs can simulate a Turing Machine and can represent any computable function – Siegelman and Sontag, 1995. – Used an RNN off finite size consisting 886 units
- 38. RNNs in Practice • Types of RNN used in Practice – Vanilla RNNs – Bidirectional RNNs – Deep Bidirectional RNNs – Long Short-Term Memory (LSTM) • Practical Applications of RNNs – Language Modeling And Generating Text – Machine Translation – Speech Recognition – Generating Image Descriptions
- 39. Computational Graphs • Computational Graph: Formalization of the structure of a set of computations. • Unfolding a recursive computation into a graph with repetitive structure results in parameter sharing across a deep network structure. • Any function involving a recurrence is an RNN • Hidden Units in RNN: – h(t) = f(h(t-1), x(t), θ) – Notice that θ is the same at each time step.
- 40. Unfolding an RNN
- 41. Training RNNs • Backpropagation in Computational Graphs – Backprogation can be derived for any computational graph by recursively applying the chain rule. (Deep Learning, Chapter 6) – The backprogation algorithm consists of performing a Jacobian-gradient-product for each operation in the graph – In vector calculus, the Jacobian matrix is the matrix of all first-order partial derivatives of a vector-valued function • Backpropagation Through Time (BPTT). – Gradient at each output depends not only on the calculations of the current time step, but also the previous time steps. – Vanilla RNNs trained with BPTT have difficulties learning long-term dependencies, i.e. dependencies between (words) steps that are far apart) • “I grew up in France… I speak fluent French” – Suffers from vanishing/exploding gradient problem. • Vanishing gradient: your gradients get smaller and smaller in magnitude as you backpropagate through earlier layers (or through time). • Activation functions like the sigmoid function produce gradients in range [-1,1] which easily causes the gradient to vanish in earlier layers. • Exploding gradient: more of an issue with recurrent networks, where the opposite happens due to a Jacobian with determinant greater than 1. – Certain types of RNNs (like LSTMs) were specifically designed to get around these problems.
- 42. Long Short Term Memory (LSTM) • LSTMs are a special kind of RNN, capable of learning long-term dependencies. • Successful in handwriting recognition, speech recognition, image captioning and machine translation • Type of gated network • Introduced by Hochreiter & Schmidhuber (1997) – Added self-loops which allowed gradient to flow for long durations. – Weight on the self-loop based on context rather than fixed. (Gers et al., 2000) – Based on the idea of creating paths through the network in which the gradient neither vanishes nor explodes. • Based on the idea of creating paths through the network in which the gradient neither vanishes nor explodes. • Leaky units allowed information to accumulate over a long duration • LSTM’s generalize leaky units by allowing connection weights to change over time. • LSTM’s allow the network to decide when to forget information. • A single hidden unit in an LSTM is replaced with a recurrent network cell consisting of 4 components that interact with each other.
- 43. Gated Network Cells • Gated network cells replace the hidden units of RNNs • Input feature is computed using the ANN unit. • The input can be accumulated if input gate allows it. • The state has a self-loop controlled by the forget gate • The output can be turned off by the output gate 28×28
- 44. LSTM in NLP Generation Image credit: Google Research Blog
- 45. LSTM Summary • A type of RNN architecture that addresses the vanishing/exploding gradient problem. • LSTM allow the learning of long-term dependencies which is crucial for sequences of inputs. • Recently achieved state-of-the-art performance in speech recognition, language modeling, translation, image captioning
- 46. Additional Topics… • Generalized Adversarial Networks (GANs) • Deep Reinforcement Learning (DRL) • Differentiable Neural Computers (DNCs)
- 47. Part 3 Generative Adversarial Networks (GANs) Generative Adversarial Networks are an example of generative models. GANs focus primarily on sample generation, though it is possible to design GANs that can estimate the probability distribution.
- 48. GAN Framework • Based on the idea of a two player game – Player 1: Generator – Player 2: Discriminator • The generator generates samples and tries to fool the discriminator • The discriminator determines if the generated samples are real or fake
- 49. Why GANs are useful • When predicting the next frame in a video, using the Mean Squared Error (MSE) causes an averaging over many possible futures which causes the ear to disappear and blurring of the eyes • The adversarial version does a much better job preserving the ear and not blurring the eyes. Image credit: Ian Goodfellow, GANs Tutorial, NIPS 2016
- 50. GANs Summary • GANs are generative models that use supervised learning to approximate an intractable cost function • GANs requires finding Nash equilibria in high dimensional, continuous, non-convex games. • GANs are crucial to many different state of the art image generation and manipulation systems.
- 51. Part 4 Deep Reinforcement Learning (DRL) Deep Reinforcement Learning combines both Deep Learning and Reinforcement Learning by using Deep Learning techniques to learn values for the Q Function in Reinforcement Learning. This is described in Google Deep Mind’s Atari paper and exemplified by the AlphaGo program
- 52. Deep Reinforcement Learning • Combines Reinforcement Learning with Deep Learning • A Form of model-free or unsupervised learning • Uses Neural Nets to estimate Q Values. • Very new field. No Wikipedia Page on this topic. • Idea is to 3feed states and actions into the network to predict Q values. • Neural networks are exceptionally good in coming up with good features for highly structured data. • This is the technology used by Google DeepMind’s AlphaGo program.
- 53. Reinforcement Learning Revisited • Definitions – Policy π is a way of selecting an action given a state – Value function Qπ (s,a) is the expected total reward for performing action a from state s given policy π • Different Approaches – Policy Based RL • Search for the optimal policy in space of policies – Value-based RL • Estimate optimal value function Q*(s,a) – Model-based RL • Build a model of the environment and use look ahead
- 54. The Many States Problem • In the Nature Deep Mind Atari paper: • Take four last screen images, resize them to 84×84 and convert then to gray scale with 256 gray levels. • This yields 25684×84×4≈1067970 possible game states. • This means 1067970 rows in our imaginary Q-table. • That is more than the number of atoms in the known universe!
- 56. Deep Q-Learning Error & Gradient • Represent Q function using a deep network. • Error function • Gradient
- 57. Strategies & Tricks • Experience Relay – During gameplay all the experiences <s,a,r,s′> are stored in a replay memory. – When training the network, random samples from the replay memory are used instead of the most recent transition. – This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum. – Also experience replay makes the training task more similar to usual supervised learning, which simplifies debugging and testing the algorithm. – One could actually collect all those experiences from human gameplay and the train network on these. • Exploration-Exploitation – ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q-value.
- 59. DeepMind Atari Deep-Q Network
- 60. References (1) • Neural Nets & Deep Learning – http://neuralnetworksanddeeplearning.com/chap2.html – http://deeplearning.net/tutorial/deeplearning.pdf • Convolutional Neural Networks – http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf – http://neuralnetworksanddeeplearning.com/chap6.html – http://cs231n.github.io/convolutional-networks/ – Visualizing and Understanding Convolutional Networks – Convolutional Neural Networks backpropagation: from intuition to derivation – An Intuitive Explanation of Convolutional Neural Networks – Backpropagation in Convolutional Neural Networks • Recurrent Neural Nets – http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf – http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ – http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language- model-rnn-with-python-numpy-and-theano/ – http://www.wildml.com/2015/10/recurrent-neural-networks-tutorial-part-3-backpropagation-through- time-and-vanishing-gradients/ – http://www.wildml.com/2015/10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn- with-python-and-theano/
- 61. References (2) • Generative Adversarial Networks – NIPS 2016 Tutorial: Generative Adversarial Networks • Deep Reinforcement Learning – http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/deep_rl.pdf – http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/ • Differentiable Neural Computers – https://deepmind.com/blog/differentiable-neural-computers/ • Google DeepMind DRL Atari Paper – https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf
- 62. Questions • Goodfellow quote on BP on Quora • Vanishing / exploding gradient