With the recent advances into neural networks capabilities to process text and audio data we are very close creating a natural human assistant. TensorFlow from Google is one of the most popular neural network library, and using Keras you can simplify TensorFlow usage. TensorFlow brings amazing capabilities into natural language processing (NLP) and using deep learning, we are expecting bots to become even more smarter, closer to human experience. In this technical discussion, we will explore NLP methods in TensorFlow with Keras to create answer bot, ready to answers specific technical questions. You will learn how to use TensorFlow to train an answer bot, with specific technical questions and use various AWS services to deploy answer bot in cloud.
2. Introduction
• Avkash Chauhan, H2O.ai
o Head of enterprise products and customers
o @avkashchauhan | https://www.linkedin.com/in/avkashchauhan
• Products
o H2O
o Sparkling Water
o Deep Water
• NN – Tensorflow, mxnet, Caffe
• GPU
• xgboost - Distributed
3. What is an AnswerBot?
• An AnswerBot is an standalone intelligent application
• AnswerBot uses machine learning to respond user input
• Provide relevant knowledge base articles as answers
• Self-service customer base
• Raises awareness of knowledge base offerings
• Generate product feedback silently
8. Problems to solve
• Finding proper tags
• Finding & Removing NSFW words
• Sentiment in the question (positive or negative)
• Priority to find the answer (Low, medium, high, critical)
• Can we figure out if questioner is male or female?
• Question rating (How the question was written?)
• Findings best available answers
• Duplicate Questions
9. Problems to solve – Solutions (Part 1)
1. Finding proper tags:
1. Word Embedding's
2. Matching words
2. Finding & Removing NSFW words
1. Brute Force Search
2. NLTK Stop Words
3. Sentiment in the question: (Positive or Negative)
1. Binomial (2 classes)classification
1. Tree Based Algorithms (GBM/RF/DRF) or NN
10. Problems to solve – Solutions (Part 2)
1. Priority to find the answer (Low/Medium/High/Critical)
1. Multinomial (4 classes) classification
1. Tree based algorithms (GBM/RF/DRF) or NN
2. Can we figure our if questioner is male or female?
1. Binomial (2 classes Classification)
1. Tree based algorithms (GBM/RF/DRF) or NN
3. Question rating (How the question was written?)
1. Multinomial (N class – 1-5 star) classification
1. Tree Based algorithms or NN
11. Problems to solve – Solutions (Part 3)
1. Findings best available answers
1. Looking for the tags and keywords – Clustering / Reduction
2. Creating tag & keywords weights for each question
3. Matching tag, keywords and their weights to find top
probabilities
2. Duplicate Questions
1. Quora has same problem to solve on Kaggle
1. https://www.kaggle.com/c/quora-question-pairs/data
2. https://www.kaggle.com/anokas/data-analysis-xgboost-
starter-0-35460-lb
12. Data Preparation
• Real Data
o Real Question/Answers
• StackOverflow, Community, Quora, Support System
• Experimental Data
o Yelp – 41M reviews in 1-5 stars category - Supervised
• Ratings: 1-5
o Twitter Sentiment – Search it OR Mine It - Supervised
• Positive/Negative
• Male/Female
13. Our Experimentation Today
• Classifying sentences to predict
o Ratings: Starts (1-5)
• Multinomial classification example
o Sentiments: Positive or Negative
• Binomial classification example
15. Why Keras?
• High level API (Python) to run top of Tensorflow & Theano
• Great for quick and fast experimentation
• Supports both CNN and RNN and combination of two
• Run on CPU & GPU
• Visit: https://blog.keras.io/keras-as-a-simplified-interface-to-
tensorflow-tutorial.html
16. Word2vec
• Word2vec is an Neural Network based word embedding
method.
• A Neural Network with only 1 linear hidden layer
o Hidden layer's is used to transform inputs into something
that the output layer can use.
o Each hidden unit has the linear activation
• Represent words in a continuous, low dimensional vector
space ((i.e., the embedding space)
o Semantically similar words are mapped to nearby points.
17. Understanding Dataset
• Ratings Analysis
o review,stars
o The food is WAAAAY overpriced and totally not worth it, they charged for the salsa and the service was ridiculously slow....The
guacamole was good though., 2
o Decent food at a great price. Unfortunately, the place is so jam packed it's almost an inconvenience to head back to the buffet
lines., 2
o Love getting my haircut here! It's only $25 for a women's haircut. I'm pretty picky about how much my hair is layered and I've
never had a problem here. Make sure to call in to schedule your appointment ahead of time during the school year because she's
usually booked two days in advance., 5
• Sentiment Analysis
o Text, Sentiment
o I lost $80 today I know I shouldn't put things in my back pocket but I was about to put in my bag when I realized it was gone., 0
o Just got back from Seattle. Lots of crowds. Nordstrom was nuts. But Taphouse Grill was practically empty. Found hardcover of
Mad Love!, 1
o Crunch week! This Friday, I'll be heading to Oddmall, my first major craft fair, in Hudson, Ohio! I'm tricking out the website., 1
o Another beautiful day out today!! Going to build some models first then go for running!! 1
o Tired. Just tired. Home time!! I'm weaksauce, I know , 0
18. Components & Experimentation
• Keras
• Tensorflow
o GPU
• NLTK
o Using Stop Words
• Glove
o Pre-trained word2vec datasets
o Small (400K words)
• Python
• Jupyter notebook
19. Experimentation – Part 1
1. Data Preparation
2. Creating word collection
1. Removing stop words
2. Collecting all words into a big list
3. Tokenization and uniform data collection
1. Using full words collection
2. Get unique words in our collection
3. Tokenize are sentence level
4. Final Dataset
1. Sentences [sentences_per_record, length] - X
2. Labels [label_per_recordm, length] – Y
20. Experimentation– Part 2
4. Splitting dataset to training and validation
5. Creating Embedding Matrix
o Loading predefined word vector
o Finding match words from our collection and creating
embedding word matrix
6. Creating Embedding Layer/Configuration
7. Training
21. Experimentation– Part 3
8. Understanding results
o Layers connection
o Model configuration
o Model weights
9. Saving model configuration, weights, data-model
o HDF5 is a data model, library, and file format for
storing and managing data
22. Experimentation– Part 4
10. Model Metrics and Performance
o Getting Model Metrics
o Model Performance Graph
o Model Accuracy
• Training
• Validation
11.Prediction
o Validation Data
o User Input
23. What if you hit exact same prediction
• Bad Model - Could be a bad model. Retrain it.
• Rebalance your dataset:
o Either upsample less frequent class
o Or downsample more frequent one.
• Adjust class weights: Setting higher class weight for
less frequent class, network will put more attention on the
downsampled class during training
• Increase the time of training: After long training time
network starts concentrating more on less frequent classes.
24. Advance Processing
• Engine:
o Doc2seq -https://radimrehurek.com/gensim/models/doc2vec.html
o Seq2seq - https://github.com/farizrahman4u/seq2seq
o Lda2vec - http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/
o RNN & LSTM - https://arxiv.org/pdf/1502.06922.pdf
• Training
o CPU vs GPU
o Checkpoints with training
25. AnswerBot production pipeline in cloud (AWS)
Community
Stackoverflow
Reddit
Quora
Slack
Bot
AWS
API Gateway
AWS Lambda
(Question Scoring)
S3
DynamoDB
AWS SQS
A ML pipeline prototype to get top N matching answers
AWS SNS
Scoring Pipeline
Model
Preparation
Process
Model Production
Support Portal
26. Content
• Github - https://github.com/Avkash/mldl/tree/master/tensorbeat-answerbot
• Dataset
o Sentiment : Search it or Mine it
o 5Star - https://www.yelp.com/dataset_challenge/dataset
• Python/Jupyter Notebook
o Sentiment:
• make-sentiment-model.py
• PositiveNegative.ipynb
o 5Star – make-5star-model.py
• make-5star-model.py
• 5StarReviews.ipynb
o Prediction – PredictNow.py