Creating AnswerBot with Keras and TensorFlow (TensorBeat)

Introduction
• Avkash Chauhan, H2O.ai
o Head of enterprise products and customers
o @avkashchauhan | https://www.linkedin.com/in/avkashchauhan
• Products
o H2O
o Sparkling Water
o Deep Water
• NN – Tensorflow, mxnet, Caffe
• GPU
• xgboost - Distributed

What is an AnswerBot?
• An AnswerBot is an standalone intelligent application
• AnswerBot uses machine learning to respond user input
• Provide relevant knowledge base articles as answers
• Self-service customer base
• Raises awareness of knowledge base offerings
• Generate product feedback silently

AnswerBot – Client Interface

AnswerBot – Result Interface
Possible Answers:
Possible Answers:
60%
42%
60%
42%
More..
More..
More..
More..

AnswerBot – Administrator Interface
Male Female
Positive Negative
Question
Tags
Sentiment
Priority Low CriticalMedium High
Sex
Ratings
Top (n) Answers 728 35% 728 27% 718 17% 800 13% 128 3%
128 20% 18 20% 621 20% 801 20% 1208 20%
NSFW 3

Community
Stackoverflow
Reddit
Quora
Slack
Bot
AWS
API Gateway
AWS Lambda
(Question Scoring)
S3
DynamoDB
AWS SQS
A ML pipeline prototype to get top N matching answers
AWS SNS
AnswerBot in production - Teaser
Scoring Pipeline
Model
Preparation
Process
Model Production
Support Portal

Problems to solve
• Finding proper tags
• Finding & Removing NSFW words
• Sentiment in the question (positive or negative)
• Priority to find the answer (Low, medium, high, critical)
• Can we figure out if questioner is male or female?
• Question rating (How the question was written?)
• Findings best available answers
• Duplicate Questions

Problems to solve – Solutions (Part 1)
1. Finding proper tags:
1. Word Embedding's
2. Matching words
2. Finding & Removing NSFW words
1. Brute Force Search
2. NLTK Stop Words
3. Sentiment in the question: (Positive or Negative)
1. Binomial (2 classes)classification
1. Tree Based Algorithms (GBM/RF/DRF) or NN

1. Priority to find the answer (Low/Medium/High/Critical)
1. Multinomial (4 classes) classification
1. Tree based algorithms (GBM/RF/DRF) or NN
2. Can we figure our if questioner is male or female?
1. Binomial (2 classes Classification)
1. Tree based algorithms (GBM/RF/DRF) or NN
3. Question rating (How the question was written?)
1. Multinomial (N class – 1-5 star) classification
1. Tree Based algorithms or NN

1. Findings best available answers
1. Looking for the tags and keywords – Clustering / Reduction
2. Creating tag & keywords weights for each question
3. Matching tag, keywords and their weights to find top
probabilities
2. Duplicate Questions
1. Quora has same problem to solve on Kaggle
1. https://www.kaggle.com/c/quora-question-pairs/data
2. https://www.kaggle.com/anokas/data-analysis-xgboost-
starter-0-35460-lb

Data Preparation
• Real Data
o Real Question/Answers
• StackOverflow, Community, Quora, Support System
• Experimental Data
o Yelp – 41M reviews in 1-5 stars category - Supervised
• Ratings: 1-5
o Twitter Sentiment – Search it OR Mine It - Supervised
• Positive/Negative
• Male/Female

Our Experimentation Today
• Classifying sentences to predict
o Ratings: Starts (1-5)
• Multinomial classification example
o Sentiments: Positive or Negative
• Binomial classification example

Demo
• Binomial & Multinomial Classification
$ python PredictNow.py

Why Keras?
• High level API (Python) to run top of Tensorflow & Theano
• Great for quick and fast experimentation
• Supports both CNN and RNN and combination of two
• Run on CPU & GPU
• Visit: https://blog.keras.io/keras-as-a-simplified-interface-to-
tensorflow-tutorial.html

Word2vec
• Word2vec is an Neural Network based word embedding
method.
• A Neural Network with only 1 linear hidden layer
o Hidden layer's is used to transform inputs into something
that the output layer can use.
o Each hidden unit has the linear activation
• Represent words in a continuous, low dimensional vector
space ((i.e., the embedding space)
o Semantically similar words are mapped to nearby points.

Understanding Dataset
• Ratings Analysis
o review,stars
o The food is WAAAAY overpriced and totally not worth it, they charged for the salsa and the service was ridiculously slow....The
guacamole was good though., 2
o Decent food at a great price. Unfortunately, the place is so jam packed it's almost an inconvenience to head back to the buffet
lines., 2
o Love getting my haircut here! It's only $25 for a women's haircut. I'm pretty picky about how much my hair is layered and I've
never had a problem here. Make sure to call in to schedule your appointment ahead of time during the school year because she's
usually booked two days in advance., 5
• Sentiment Analysis
o Text, Sentiment
o I lost $80 today I know I shouldn't put things in my back pocket but I was about to put in my bag when I realized it was gone., 0
o Just got back from Seattle. Lots of crowds. Nordstrom was nuts. But Taphouse Grill was practically empty. Found hardcover of
Mad Love!, 1
o Crunch week! This Friday, I'll be heading to Oddmall, my first major craft fair, in Hudson, Ohio! I'm tricking out the website., 1
o Another beautiful day out today!! Going to build some models first then go for running!! 1
o Tired. Just tired. Home time!! I'm weaksauce, I know , 0

Components & Experimentation
• Keras
• Tensorflow
o GPU
• NLTK
o Using Stop Words
• Glove
o Pre-trained word2vec datasets
o Small (400K words)
• Python
• Jupyter notebook

Experimentation – Part 1
1. Data Preparation
2. Creating word collection
1. Removing stop words
2. Collecting all words into a big list
3. Tokenization and uniform data collection
1. Using full words collection
2. Get unique words in our collection
3. Tokenize are sentence level
4. Final Dataset
1. Sentences [sentences_per_record, length] - X
2. Labels [label_per_recordm, length] – Y

Experimentation– Part 2
4. Splitting dataset to training and validation
5. Creating Embedding Matrix
o Loading predefined word vector
o Finding match words from our collection and creating
embedding word matrix
6. Creating Embedding Layer/Configuration
7. Training

8. Understanding results
o Layers connection
o Model configuration
o Model weights
9. Saving model configuration, weights, data-model
o HDF5 is a data model, library, and file format for
storing and managing data

10. Model Metrics and Performance
o Getting Model Metrics
o Model Performance Graph
o Model Accuracy
• Training
• Validation
11.Prediction
o Validation Data
o User Input

What if you hit exact same prediction
• Bad Model - Could be a bad model. Retrain it.
• Rebalance your dataset:
o Either upsample less frequent class
o Or downsample more frequent one.
• Adjust class weights: Setting higher class weight for
less frequent class, network will put more attention on the
downsampled class during training
• Increase the time of training: After long training time
network starts concentrating more on less frequent classes.

Advance Processing
• Engine:
o Doc2seq -https://radimrehurek.com/gensim/models/doc2vec.html
o Seq2seq - https://github.com/farizrahman4u/seq2seq
o Lda2vec - http://multithreaded.stitchfix.com/blog/2016/05/27/lda2vec/
o RNN & LSTM - https://arxiv.org/pdf/1502.06922.pdf
• Training
o CPU vs GPU
o Checkpoints with training

AnswerBot production pipeline in cloud (AWS)
Community
Stackoverflow
Reddit
Quora
Slack
Bot
AWS
API Gateway
AWS Lambda
(Question Scoring)
S3
DynamoDB
AWS SQS
A ML pipeline prototype to get top N matching answers
AWS SNS
Scoring Pipeline
Model
Preparation
Process
Model Production
Support Portal

Content
• Github - https://github.com/Avkash/mldl/tree/master/tensorbeat-answerbot
• Dataset
o Sentiment : Search it or Mine it
o 5Star - https://www.yelp.com/dataset_challenge/dataset
• Python/Jupyter Notebook
o Sentiment:
• make-sentiment-model.py
• PositiveNegative.ipynb
o 5Star – make-5star-model.py
• make-5star-model.py
• 5StarReviews.ipynb
o Prediction – PredictNow.py

Creating AnswerBot with Keras and TensorFlow (TensorBeat)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Creating AnswerBot with Keras and TensorFlow (TensorBeat)

Semelhante a Creating AnswerBot with Keras and TensorFlow (TensorBeat) (20)

Mais de Avkash Chauhan

Mais de Avkash Chauhan (17)

Último

Último (20)

Creating AnswerBot with Keras and TensorFlow (TensorBeat)