The document discusses using convolutional neural networks (CNNs) for text classification. It presents two CNN architectures - a character-level CNN that takes raw text as input and a word-level CNN that uses word embeddings. The word-level CNN achieved 85% accuracy on a product categorization task and was faster to train and run than the character-level CNN or traditional SVMs. The document concludes that word-level CNNs are a promising approach for text classification that can achieve high accuracy with minimal tuning.
6. Traditional vs Deep Learning Approach
6
Traditional Approach Deep Learning Approach
• Well Understood
• More than 2 decades of active research
• Successfully used in many applications
• Nascent, started around 2014-2015
• More number of steps and several choices for each
step
• Right choices are well established
• Less number of steps
• Major time is spent on feature engineering • Major time is spent on parameter tuning
• Easy to serve model in real time • Real time serving of model can be challenging
It is hard to beat the accuracy of traditional approach in text classification!!!
11. Character-level CNN: Characteristics
Layer Filter Subsample Output shape Activation Param #
Input - 70 x 1014
Convolution1 256@70 x 7 1 x 3 1 x 336 x 256 Relu 125,696
Convolution2 256@1 x 7 1 x 3 1 x 110 x 256 Relu 2,048
Convolution3 256@1 x 3 - 1 x 108 x 256 Relu 1,024
Convolution4 256@1 x 3 - 1 x 106 x 256 Relu 1,024
Convolution5 256@1 x 3 - 1 x 104 x 256 Relu 1,024
Convolution6 256@1 x 3 1 x 3 1 x 34 x 256 Relu 1,024
Flatten - - 8704- -
FC1 - - 1024- 8,913,920
FC2 - - 1024- 1,049,600
FC3 - - 380- 389,500
Total 10,484,860
11
• Slow to train
• Slow during inference, more than 100 millisecond on a P100 GPU
• Achieves 79% accuracy on the test set
12. Word-level CNN
• Input text is represented as a nxk matrix using word embeddings
• n is the maximum number of words in the text. Padded/truncated when necessary
• k is the length of embedding
• Apply multiple convolutions of width k and different heights fi
• Height of a filter output is (n – fi + 1)
• Apply max-pooling across (n– fi + 1) height to select 1 output per filter
• Intuitively detects presence of a feature in the text
• n x k representation can be learned as part of the network, or pre-trained word embedding can be
used 12
Figure from https://arxiv.org/pdf/1408.5882.pdf
13. Word-level CNN: Our implementation
n x v
One-hot encoding
n x k
p@f1 x k p@f2 x k p@f3 x k
(n – f1 + 1) x p (n – f2 + 1) x p (n – f3 + 1) x p
1 x p 1 x p 1 x p
1 x o
Embedding (v x k)
Convolution
Max pooling
Fully connected
Parameter Setting
Sentence length n = 25
Vocabulary size v = 500K
Embedding size d = 128
Convolutions f1, f2, f3 = 2, 3 ,4
p = 128
Output o = 380
Total Number of Parameters
Embedding v x k = 64M
Convolution Filters (f1 + f2 + f3) * k
* p = 147K
Fully Connected = 3 * p x o = 145K
Total = 64,293,376
13
14. Convolution output
14
Phrase Weight
sensitive skin moisturizing cream 3.814296
dry sensitive skin moisturizing 2.8061242
cream 16.0 oz END_TOKEN 2.5697758
skin moisturizing cream 16.0 2.3056493
moisturizing cream 16.0 2.1790688
Phrase Weight
fairytale dress sandal END_TOKEN 4.5367112
dress sandal END_TOKEN 3.122334
fairytale dress sandal END_TOKEN 2.9044547
mojo moxy 2.8222353
dress sandal END_TOKEN END_TOKEN 2.6823337
tokens around “moisturizing cream” weighted high to
categorize under “Personal Care/ Bath & Body”
tokens around “dress sandal” weighted high to
categorize under “Clothing/Shoes”. Also the brand
“mojo moxy” which makes shoes got high weight
17. Parameter Tuning
Method Accuracy
Baseline 85.20%
More filters of size [2, 3, 4, 5, 6] 85.50%
Dropout probability from 0.5 increased to 0.75 85.97%
Batch size 2048 instead of 512 84.91%
Batch size 64 instead of 512 79.00%
17
18. Scaling
Processor Word-CNN Char-CNN
P100 112 395
K80 209 662
Intel Xeon
1.8Gz, 8 core
301 8000
Training time in minutes for 1 epoch over 10s of
millions of product titles
18
Inference time in millisecond for one example
Word-CNN Char-CNN
4-8 millisecond >100 millisecond
Inference can be done on CPU in few milliseconds!!!
21. Conclusion
• Word-CNN is better and faster than Character-CNN
• Tokenization (i.e. some feature engineering) is still important even in case of DNN
• Word-CNN is a very promising network for Text classification
• Very robust, easy to achieve good accuracy with very little parameter tuning
• Can be trained in few hours on a CPU on 10s of millions of examples
• Inference can be done within few milliseconds even on a CPU
• Can be deployed to do inference (scoring) in real time
• It is promising to see CNN achieving state of the art accuracy on a very well studied
problem with very little effort
• And the field is rapidly making progress
• Hopefully much higher accuracy soon!!!
21