This is the content for the 4th International Conference on Data Management, Analytics and Innovation - ICDMAI 2020.
Here we talk & discuss LSTM and Text Classification for Multi-Class & Multi-Label problem statement.
Code : https://github.com/amitbcp/icdmai_2020
2. About Us
Abzooba is an Artificial Intelligence (AI) company. We partner with
enterprises in their cognitive journey to augment digital
transformation
Headquartered in Silicon Valley (USA),
with delivery centres in Kolkata & Pune
(India)
200+ Employees
We use xpresso.ai, a set of internal accelerators and
toolkits, to deliver custom built AI and ML solutions
to our customers
3. Service Offerings Overview
Big Data and Cloud
Collecting and processing of data for
ingestion
Data Science
Deep Learning, Computer Vision and NLP
to solve business problems
Building enterprise-class infrastructure for
seamless AI integration
Lightning fast data processing and real-time insight
provision through distributed processing
Structured and unstructured data ingestion
through a scalable data lake
Expertise in the state-of-the-art Big Data and data
preprocessing tools
Supervised and unsupervised ML algorithms to
solve industry-specific use cases
Creating an efficient ecosystem through natural
language understanding
Deep learning-based computer vision algorithms
to gather insight from images
Ability to process large volume of data using
parallel processing
Deployment of AI solutions in production using
DevOps framework
AI Ops
Integrated Development, Deployment and
Management infrastructure
4. Index
1. Problem Statement
2. Naïve Solution
3. Word Embedding
4. Recurrent Neural Network
5. Ensemble & Evaluation
6. Hands On
6. www.abzooba.co
A Recurrent Neural Pipeline
Build a Recommendation System, that can recommend Technology
Domains & Tags for questions posted on Stack overflow.
Given a question in Stack-Overflow, predict the Technology Domain
& Associated Tags for it.
7. Multi Class |
Multi Label Document
Politics
Election
Budget
Ban
Entertainment
Concert
Movies
Games
Player
Selection
8. Classification into multiple-classes
which are independent & Labels
which are not mutually exclusive in
a hierarchical manner.
Question
Dev Ops
Jenkins
AWS
Docker
QA
Testing
Selenium
Big Data
Hadoop
Apache Spark
Haskell
11. www.abzooba.co
Understand Data Sourcing
• How was the data collected
• Biased data-collection technique
• Complete or subset of original
data
• Data bias due to the problem or
not ?
Understand Data Behavior
• Word-Character Distribution
• Time based Distribution
• Class/Label DistributionExploratory
Data
Analysis
12. www.abzooba.co
EDA
Summary
Group Group Name
1 Programming
2 MS-Development Environment
3 Server-Side Development
4 Mobile App Development
5 Dev Environment
6 Front-end/Designing
7 Dynamic UI
8 MVC
9 Dev Ops
10 Big Data
11 QA
12 Project Management
13 Scripting
14 Business Analytics
14. www.abzooba.co
Machine Learning with Bag of Words
Master Vocabulary
• Of all the words present in the document
Create
Words with Numbers
• Based on Index in the Vocabulary
Replace
To ML Model
• Document-Label Pair to model
Feed
15. www.abzooba.co
Words with Numbers
• Based on Index in the Vocabulary
Replace
Notes
• Frequency count based.
• Term frequency–Inverse document frequency.
• Sequence of words not conserved.
16. www.abzooba.co
To ML Model
• Document-Label Pair to model
Feed
We use a One v/s Rest strategy to learn a classifier for each Tag
19. www.abzooba.co
What are
Word
Embedding?
Frequency based Embedding
• Count Vector
• TF-IDF Vector
• Co-Occurrence Vector
Prediction based Embedding
• CBOW (Continuous Bag of
words)
• Skip – Gram model
• Transformers acrhitecture are
leading the research front
Word embeddings are feature learning techniques in NLP where
words or phrases from the vocabulary are mapped to vectors of
real numbers.
20. www.abzooba.co
Why do we need them?
ML algorithms and almost all Deep Learning Architectures are
incapable of processing strings or plain text in their raw form.
To represent human understandable language to binary machine
codes
Better representation leads to better machine understanding
29. www.abzooba.co
Parameters
1. word_embeddings: The Embeddings that we want to stack
2. hidden_size : Sequence Length for rolling over time
3. rnn_layers : Number of RNN cells to be stack. Decides how deep the network is
4. bidirectional : Decides whether the models reads the sequence from Left/Right/Both
5. reproject_words : Decides whether we want to retune the word embeddings
6. reproject_words_dimension : Embedding dimension after retuning.
7. dropout : Dropout to be used
8. rnn_type : Type of RNN Cell to be used
30. www.abzooba.co
Challenges with RNN
• Sensitive to lr & batch size
• Training Loss unstable
• Hard to Train
• Early stopping may lead to dumb model
•High patience level
• Need to train for longer
• Seq-Seq are better than Seq-Vec
32. www.abzooba.co
Our Model
• Stacked Document Embeddings.
• Multiple Pre-trained and custom trained(by us) word
embeddings stacked.
• Enables capturing different relations & features in the
document.
• Project the embedding to a trainable embedding layer.
Enables fine-tuning for down-stream/explicit task.
• Model
• LSTM
• 2 Layers
• 64Hidden Items (Rolling over time)
• 256 dimension embedding reprojection
34. Multi Class |
Multi Label
Question
Dev Ops
Jenkins
AWS
Docker
QA
Testing
Selenium
Big Data
Hadoop
Apache Spark
Haskell
35. www.abzooba.co
Re-Designed
Model
• Improved Average Coverage by 17% by just changing to
Ensembled models.
• Trade-off with inference time & memory utilization.
• Possible expansion to ensembling on each category.
• Business Impact :
• Slower Inference
• Scalable Training
• Scalable to more domains
36. www.abzooba.co
Re-Designed
Model
• Ensembled Model of Categories
• Helped in Class imbalance. Alternative to
Bootstrapping.
• Flexibility to set Threshold.
• Flexibility for Hyper Parameter tuning differently.
• Negative Sampling for different classifiers.
37. www.abzooba.co
Take Away
Recurrent Neural Networks
needs to be trained cautisouly.
Single Deep Learning Models are
not a silver bullet solution to
problems.
Ensemble Models can be used to
sub-divide the problems and
reduce complexity.
Case-sensitivity in models and
embeddings are important factors
to be taken into consideration.
Data augmentation/extrapolation
leads to increased False Positive
apart from generalizing the
model.