Classifying Text using CNN

Classifying Text using CNN
Somnath Banerjee
WalmartLabs

Outline
• Goal of this presentation
• Text classification at Walmart
• Why to use Deep Learning
• CNN for Text Classification
• Characters as input
• Word tokens as input
• Comparison against SVM
• Conclusion
2

Goal of this presentation
Understand how Convolutional Neural Network can be used in Text Classification
3

Text Classification at Walmart
• Assign item to a category
• Assign query to a category
• Identify positive/negative reviews
• Determine relevant/irrelevant attributes
Today we will focus on a simpler problem of determining the “level 2
category” from the title of a given item.
4

Steps of Text Classification
Read Documents
Feature Extraction
Tokenization, ngrams, stemming,
phrase detection, topic modeling
Feature Selection
Informational Gain (IG), Chi-
square, odds ratio
Vector Representation
binary, tf, tf*idf
Learning Algorithm
Naïve bayes, logistic regression,
SVM, decision trees
Tokenize
Network Design
CNN, RNN, number of layers
Parameter Tuning
Traditional Approach Deep Learning Approach
Read Documents

Traditional vs Deep Learning Approach
6
Traditional Approach Deep Learning Approach
• Well Understood
• More than 2 decades of active research
• Successfully used in many applications
• Nascent, started around 2014-2015
• More number of steps and several choices for each
step
• Right choices are well established
• Less number of steps
• Major time is spent on feature engineering • Major time is spent on parameter tuning
• Easy to serve model in real time • Real time serving of model can be challenging
It is hard to beat the accuracy of traditional approach in text classification!!!

Why to use Deep Learning in Text Classification
• Leverage hyperactive and volume of research in deep learning
• Create uniform approach for all kind of data (image, video, voice, text)
• Enables multi-modal learning from text and image
• Replace domain specific feature engineering knowledge with broader knowledge of
network design and parameter tuning
• Enables more sharing of knowledge
• Enables sharing of pre-trained models
• Most deep learning networks are open source
7
Democratize Machine Learning through uniform approach and knowledge sharing

Deep Neural Networks for Text: RNN or CNN
• CNN extract features
• Works well where feature detection is important (e.g. Sentiment classification, positive/negative review
classification)
• CNN is faster to train
• Convolutions can be done in parallel, utilize full advantage of GPU parallelism
• Historically RNN has outperformed CNN where length of the document is important (e.g.
language translation)
• But RNN takes longer to train due to its sequential nature
• Recent research shows CNN can outperform RNN accuracy on language translation
https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/
8

CNN Architectures for Text Classification
We experimented with the following 2 architectures
1. Character-level CNN
• Zhang, X. et al. Character-level Convolutional Networks for Text Classification, 2015,
https://arxiv.org/pdf/1509.01626.pdf
• Absolutely no preprocessing of input
• More familiar Deep CNN architecture
• convolution – max pooling layers followed by fully connected layers
2. Word-level CNN
• Kim, Y. Convolutional Neural Networks for Sentence Classification, EMNLP 2014, https://arxiv.org/pdf/1408.5882.pdf
• Only word tokenization is used as preprocessing
• Uses max-pooling across the input
9

Character-level CNN
• Input text is represented as k x n matrix of one-hot encoding of the characters
• k is size of the alphabet set
• n is the maximum number of characters in the input text. Padded/truncated when necessary
• Imagine this as a single channel, gray scale, k x n image
• Apply series of convolution, max-pooling and then fully connected layers
10
Figure from https://arxiv.org/pdf/1509.01626.pdf

Character-level CNN: Characteristics
Layer Filter Subsample Output shape Activation Param #
Input - 70 x 1014
Convolution1 256@70 x 7 1 x 3 1 x 336 x 256 Relu 125,696
Convolution3 256@1 x 3 - 1 x 108 x 256 Relu 1,024
Flatten - - 8704- -
FC1 - - 1024- 8,913,920
FC2 - - 1024- 1,049,600
FC3 - - 380- 389,500
Total 10,484,860
11
• Slow to train
• Slow during inference, more than 100 millisecond on a P100 GPU
• Achieves 79% accuracy on the test set

Word-level CNN
• Input text is represented as a nxk matrix using word embeddings
• n is the maximum number of words in the text. Padded/truncated when necessary
• k is the length of embedding
• Apply multiple convolutions of width k and different heights fi
• Height of a filter output is (n – fi + 1)
• Apply max-pooling across (n– fi + 1) height to select 1 output per filter
• Intuitively detects presence of a feature in the text
• n x k representation can be learned as part of the network, or pre-trained word embedding can be
used 12
Figure from https://arxiv.org/pdf/1408.5882.pdf

Word-level CNN: Our implementation
n x v
One-hot encoding
n x k
p@f1 x k p@f2 x k p@f3 x k
(n – f1 + 1) x p (n – f2 + 1) x p (n – f3 + 1) x p
1 x p 1 x p 1 x p
1 x o
Embedding (v x k)
Convolution
Max pooling
Fully connected
Parameter Setting
Sentence length n = 25
Vocabulary size v = 500K
Embedding size d = 128
Convolutions f1, f2, f3 = 2, 3 ,4
p = 128
Output o = 380
Total Number of Parameters
Embedding v x k = 64M
Convolution Filters (f1 + f2 + f3) * k
* p = 147K
Fully Connected = 3 * p x o = 145K
Total = 64,293,376
13

Convolution output
14
Phrase Weight
sensitive skin moisturizing cream 3.814296
dry sensitive skin moisturizing 2.8061242
cream 16.0 oz END_TOKEN 2.5697758
skin moisturizing cream 16.0 2.3056493
moisturizing cream 16.0 2.1790688
Phrase Weight
fairytale dress sandal END_TOKEN 4.5367112
dress sandal END_TOKEN 3.122334
fairytale dress sandal END_TOKEN 2.9044547
mojo moxy 2.8222353
dress sandal END_TOKEN END_TOKEN 2.6823337
tokens around “moisturizing cream” weighted high to
categorize under “Personal Care/ Bath & Body”
tokens around “dress sandal” weighted high to
categorize under “Clothing/Shoes”. Also the brand
“mojo moxy” which makes shoes got high weight

Word Embedding
15
pastathe lego
• Obtained from the vxk embedding layer
• Randomly initialized
• Trained as part of the classification task

Learning Curve
16
Accuracy vs steps
Achieves 85% accuracy on the validation and test set

Parameter Tuning
Method Accuracy
Baseline 85.20%
More filters of size [2, 3, 4, 5, 6] 85.50%
Dropout probability from 0.5 increased to 0.75 85.97%
Batch size 2048 instead of 512 84.91%
Batch size 64 instead of 512 79.00%
17

Scaling
Processor Word-CNN Char-CNN
P100 112 395
K80 209 662
Intel Xeon
1.8Gz, 8 core
301 8000
Training time in minutes for 1 epoch over 10s of
millions of product titles
18
Inference time in millisecond for one example
Word-CNN Char-CNN
4-8 millisecond >100 millisecond
Inference can be done on CPU in few milliseconds!!!

Scaling ideas – low hanging fruits
- More than 60% of the time was spend in preparing the next batch of Word-CNN on a P100
- Batch preparation can be done in parallel
- Tensorflow reader can possibly be of great help
- Tensorflow compiled for SSE, AVX2 and FMA can be 4-8x faster
- Word-CNN training can be completed in 4-5 hours on 10s of millions of examples on a CPU
- Data parallel training in case of multiple GPUs
19

Comparison against SVM
• SVM with unigram + bigram features also achieves with 85% accuracy with training on
1/10th of the data
• Stochastic gradient descent on full data does not achieve more than 80% accuracy after
same number of epochs
• SVM has comparable accuracy with faster training and inference
20

Conclusion
• Word-CNN is better and faster than Character-CNN
• Tokenization (i.e. some feature engineering) is still important even in case of DNN
• Word-CNN is a very promising network for Text classification
• Very robust, easy to achieve good accuracy with very little parameter tuning
• Can be trained in few hours on a CPU on 10s of millions of examples
• Inference can be done within few milliseconds even on a CPU
• Can be deployed to do inference (scoring) in real time
• It is promising to see CNN achieving state of the art accuracy on a very well studied
problem with very little effort
• And the field is rapidly making progress
• Hopefully much higher accuracy soon!!!
21

22
We are Hiring!!!
https://www.linkedin.com/in/somnath-banerjee

Classifying Text using CNN

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Classifying Text using CNN

Semelhante a Classifying Text using CNN (20)

Último

Último (20)

Classifying Text using CNN