2. About me
● Chief Data Scientist @ Boost AI
● Machine learning enthusiast
● Kaggle junkie (highest world rank #3)
● Interested in:
○ Automatic machine learning
○ Large scale classification of text data
○ Chatbots
I like big data
and
I cannot lie
3. Agenda
● Brief introduction to deep learning
● Implementation of deepnets
● Fine-tuning of pre-trained networks
● 4 different industrial use cases
● No maths!!!!
23. How can I implement my own DeepNets?
● Implement them on your own
24. How can I implement my own DeepNets?
● Implement them on your own
○ Decompose into smaller parts
25. How can I implement my own DeepNets?
● Implement them on your own
○ Decompose into smaller parts
○ Implement layers
26. How can I implement my own DeepNets?
● Implement them on your own
○ Decompose into smaller parts
○ Implement layers
○ Start training
27. How can I implement my own DeepNets?
● Implement them on your own
○ Decompose into smaller parts
○ Implement layers
○ Start training
● Save yourself some time and finetune
28. How can I implement my own DeepNets?
● Implement them on your own
○ Decompose into smaller parts
○ Implement layers
○ Start training
● Save yourself some time and finetune
○ Convert data
29. How can I implement my own DeepNets?
● Implement them on your own
○ Decompose into smaller parts
○ Implement layers
○ Start training
● Save yourself some time and finetune
○ Convert data
○ Define net
30. How can I implement my own DeepNets?
● Implement them on your own
○ Decompose into smaller parts
○ Implement layers
○ Start training
● Save yourself some time and finetune
○ Convert data
○ Define net
○ Define solver
31. How can I implement my own DeepNets?
● Implement them on your own
○ Decompose into smaller parts
○ Implement layers
○ Start training
● Save yourself some time and finetune
○ Convert data
○ Define net
○ Define solver
○ Train
32. How can I implement my own DeepNets?
● Implement them on your own
○ Decompose into smaller parts
○ Implement layers
○ Start training
● Save yourself some time and finetune
○ Convert data
○ Define net
○ Define solver
○ Train
● Caffe (caffe.berkeleyvision.org)
● Keras (www.keras.io)
41. What do you need for Caffe?
● Convert data
● Define a network (prototxt)
42. What do you need for Caffe?
● Convert data
● Define a network (prototxt)
● Define a solver (prototxt)
43. What do you need for Caffe?
● Convert data
● Define a network (prototxt)
● Define a solver (prototxt)
● Train the network (with or without pre-trained weights)
48. Training a net using Caffe
/PATH_TO_CAFFE/caffe train --solver=solver.prototxt
49. Fine Tuning!
● Fine tuning using GoogleNet
● Why?
○ It has Google in its name
○ It won ILSVRC 2014
○ It’s complicated and I wanted to play with it
● Caffe model zoo offers a lot of pretrained nets, including GoogleNet
● Model Zoo: https://github.com/BVLC/caffe/wiki/Model-Zoo
87. Why classify search queries?
● For businesses
○ Find out user-intent
○ Track keywords according to transactional buying cycle of user
○ Optimize website content and focus on smaller keyword set
88. Why classify search queries?
● For business
○ Find out user-intent
○ Track keywords according to transactional buying cycle of user
○ Optimize website content and focussing on smaller keyword set
● For data scientists
○ 100s of millions of unlabeled keywords to play with
○ Why Not!
101. Representing Queries as Images
David Villa
Word2Vec
representations of
the top search
result titles
Apple juice
Irish
102. I don’t see much
difference!
Guild Wars or Apple Juice
103.
104. Machine Learning Models
● Boosted trees
○ Word2vec embeddings
○ Titles from top results
○ Additional features of the SERP page
○ TF-IDF
○ XGBoost!!!! (https://github.com/dmlc/xgboost)
105. Machine Learning Models
● Convolutional Neural Networks:
○ Using images directly
○ Using random crops from the image
106. Machine Learning Models
● Convolutional Neural Networks:
○ Using images directly
○ Using random crops from the image
Convolutional Neural Network
107. Machine Learning Models
● Convolutional Neural Networks:
○ Using images directly
○ Using random crops from the image
Convolutional Neural Network
Convolutional Neural Network
108. Neural Networks with Keras
Convolutional Neural Network
https://github.com/fchollet/keras
111. Approaching “any” ML problem
AutoCompete: A Framework for Machine Learning Competitions, A.Thakur and A Krohn-Grimberghe, ICML AutoML Workshop, 2015
117. Selecting NNet Architecture
● Always use SGD or Adam (for fast convergence)
● Start low:
○ Single layer with 120-500 neurons
118. Selecting NNet Architecture
● Always use SGD or Adam (for fast convergence)
● Start low:
○ Single layer with 120-500 neurons
○ Batch normalization + ReLU
119. Selecting NNet Architecture
● Always use SGD or Adam (for fast convergence)
● Start low:
○ Single layer with 120-500 neurons
○ Batch normalization + ReLU
○ Dropout: 10-20%
120. Selecting NNet Architecture
● Always use SGD or Adam (for fast convergence)
● Start low:
○ Single layer with 120-500 neurons
○ Batch normalization + ReLU
○ Dropout: 10-20%
● Add new layer:
121. Selecting NNet Architecture
● Always use SGD or Adam (for fast convergence)
● Start low:
○ Single layer with 120-500 neurons
○ Batch normalization + ReLU
○ Dropout: 10-20%
● Add new layer:
○ 1200-1500 neurons
122. Selecting NNet Architecture
● Always use SGD or Adam (for fast convergence)
● Start low:
○ Single layer with 120-500 neurons
○ Batch normalization + ReLU
○ Dropout: 10-20%
● Add new layer:
○ 1200-1500 neurons
○ High dropout: 40-50%
123. Selecting NNet Architecture
● Always use SGD or Adam (for fast convergence)
● Start low:
○ Single layer with 120-500 neurons
○ Batch normalization + ReLU
○ Dropout: 10-20%
● Add new layer:
○ 1200-1500 neurons
○ High dropout: 40-50%
● Very big network:
124. Selecting NNet Architecture
● Always use SGD or Adam (for fast convergence)
● Start low:
○ Single layer with 120-500 neurons
○ Batch normalization + ReLU
○ Dropout: 10-20%
● Add new layer:
○ 1200-1500 neurons
○ High dropout: 40-50%
● Very big network:
○ 8000-10000 neurons in each layer
125. Selecting NNet Architecture
● Always use SGD or Adam (for fast convergence)
● Start low:
○ Single layer with 120-500 neurons
○ Batch normalization + ReLU
○ Dropout: 10-20%
● Add new layer:
○ 1200-1500 neurons
○ High dropout: 40-50%
● Very big network:
○ 8000-10000 neurons in each layer
○ 60-80% dropout
131. What are clickbaits?
● 10 things Apple didn’t tell you about the new iPhone
● What happened next will surprise you
● This is what the actor/actress from 90s looks like now
● What did Donald Trump just say about Obama and Clinton
● 9 things you must have to be a good data scientist
@abhi1thakur
133. What are clickbaits?
● Interesting titles
● Frustrating titles
● Seldomly good enough content
● Google penalizes clickbait content
● Facebook does the same
@abhi1thakur
134. The data
● Crawl buzzfeed, clickhole
● Crawl new york times, cnn
● ~10000 titles
○ Clickbaits: buzzfeed, clickhole
○ Non-clickbaits: new york times, cnn
○ ~5000 from either categories
@abhi1thakur
135. Good old TF-IDF
● Very powerful
● Used both character and word analyzers
@abhi1thakur
141. Is that it?
● No!
● Model predictions:
○ “Donald Trump” : 15% Clickbait
○ “Barack Obama”: 80% Clickbait
● Something was very wrong!
● TF-IDF didn’t capture the meanings
@abhi1thakur
142. Word2Vec
● Shallow neural networks
● Generates vectors of high dimension for every word
● Every word gets a position in space
● Similar words cluster together
@abhi1thakur
147. Does word2vec capture everything?
Do we have all we need only from titles?
What if content of website isn’t clickbait-y?
@abhi1thakur
148. The data
● Crawl Buzzfeed, NYT, CNN, clickhole, etc.
● Too much work
● Simple models
● Doubts about results
● Crawl public Facebook pages:
○ Buzzfeed
○ CNN
○ The New York Times
○ Clickhole
○ StopClickBaitOfficial
○ Upworthy
○ Wikinews
Facebook page scrapper is available here:
https://github.com/minimaxir/facebook-page-post-scraper
@abhi1thakur
149. The data
● link_name (the title of the URL shared)
● status_type (whether it’s a link, photo or a video)
● status_link (the actual URL)
@abhi1thakur
152. Feature Generation
● Size of the HTML (in bytes)
● Length of HTML
● Total number of links
● Total number of buttons
● Total number of inputs
● Total number of unordered lists
● Total number of ordered lists
● Total number of lists (ordered +
unordered)
@abhi1thakur
● Total Number of H1 tags
● Total Number of H2 tags
● Full length of all text in all H1
tags that were found
● Full length of all text in all H2
tags that were found
● Total number of images
● Total number of html tags
● Number of unique html tags
167. The Problem
➢ ~ 13 million questions (as of March, 2017)
➢ Many duplicate questions
➢ Cluster and join duplicates together
➢ Remove clutter
➢ First public data release: 24th January, 2017
168. Duplicate Questions
➢ How does Quora quickly mark questions as needing improvement?
➢ Why does Quora mark my questions as needing improvement/clarification
before I have time to give it details? Literally within seconds…
➢ What practical applications might evolve from the discovery of the Higgs
Boson?
➢ What are some practical benefits of discovery of the Higgs Boson?
➢ Why did Trump win the Presidency?
➢ How did Donald Trump win the 2016 Presidential Election?
169. Non-Duplicate Questions
➢ Who should I address my cover letter to if I'm applying for a big company like
Mozilla?
➢ Which car is better from safety view?""swift or grand i10"".My first priority is
safety?
➢ Mr. Robot (TV series): Is Mr. Robot a good representation of real-life hacking
and hacking culture? Is the depiction of hacker societies realistic?
➢ What mistakes are made when depicting hacking in ""Mr. Robot"" compared
to real-life cybersecurity breaches or just a regular use of technologies?
➢ How can I start an online shopping (e-commerce) website?
➢ Which web technology is best suitable for building a big E-Commerce
website?
170. The Data
➢ 400,000+ pairs of questions
➢ Initially data was very skewed
➢ Negative samples from related questions
➢ Not real distribution on Quora’s website
➢ Noise exists (as usual)
https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
171. The Data
➢ 255045 negative samples (non-duplicates)
➢ 149306 positive samples (duplicates)
➢ 40% positive samples
172. The Data
➢ Average number characters in question1: 59.57
➢ Minimum number of characters in question1: 1
➢ Maximum number of characters in question1: 623
➢ Average number characters in question2: 60.14
➢ Minimum number of characters in question2: 1
➢ Maximum number of characters in question2: 1169
173. Basic Feature Engineering
➢ Length of question1
➢ Length of question2
➢ Difference in the two lengths
➢ Character length of question1 without spaces
➢ Character length of question2 without spaces
➢ Number of words in question1
➢ Number of words in question2
➢ Number of common words in question1 and question2
175. Fuzzy Features
➢ pip install fuzzywuzzy
➢ Uses Levenshtein distance
➢ QRatio
➢ WRatio
➢ Token set ratio
➢ Token sort ratio
➢ Partial token set ratio
➢ Partial token sort ratio
➢ etc. etc. etc.
https://github.com/seatgeek/fuzzywuzzy
177. TF-IDF
➢ TF(t) = Number of times a term t appears in a document / Total number of
terms in the document
➢ IDF(t) = log(Total number of documents / Number of documents with term t in
it)
➢ TF-IDF(t) = TF(t) * IDF(t)
tfidf = TfidfVectorizer(min_df=3, max_features=None,
strip_accents='unicode', analyzer='word',token_pattern=r'w{1,}',
ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1,
stop_words = 'english')
179. Fuzzy Features
➢ Also known as approximate string matching
➢ Number of “primitive” operations required to convert string to exact match
➢ Primitive operations:
○ Insertion
○ Deletion
○ Substitution
➢ Typically used for:
○ Spell checking
○ Plagiarism detection
○ DNA sequence matching
○ Spam filtering
185. Word2Vec Features
➢ Multi-dimensional vector for all the words in any dictionary
➢ Always great insights
➢ Very popular in natural language processing tasks
➢ Google news vectors 300d
186. Word2Vec Features
➢ Representing words
➢ Representing sentences
def sent2vec(s):
words = str(s).lower().decode('utf-8')
words = word_tokenize(words)
words = [w for w in words if not w in stop_words]
words = [w for w in words if w.isalpha()]
M = []
for w in words:
M.append(model[w])
M = np.array(M)
v = M.sum(axis=0)
return v / np.sqrt((v ** 2).sum())
187. W2V Features: WMD
Kusner, M., Sun, Y., Kolkin, N. & Weinberger, K.. (2015). From Word Embeddings To Document Distances.
188. W2V Features: Skew
➢ Skew = 0 for normal distribution
➢ Skew > 0: more weight in left tail
189. W2V Features: Kurtosis
➢ 4th central moment over the square of variance
➢ Types:
○ Pearson
○ Fisher: subtract 3.0 from result such that result is 0 for normal distribution
198. LSTM
➢ Long short term memory
➢ A type of RNN
➢ Learn long term dependencies
➢ Used two LSTM layers
199. 1D CNN
➢ One dimensional convolutional layer
➢ Temporal convolution
➢ Simple to implement:
for i in range(sample_length):
y[i] = 0
for j in range(kernel_length):
y[i] += x[i-j] * h[j]
201. Time Distributed Dense Layer
➢ TimeDistributed wrapper around dense layer
➢ TimeDistributed applies the layer to every temporal slice of input
➢ Followed by Lambda layer
➢ Implements “translation” layer used by Stephen Merity (keras snli model)
model1 = Sequential()
model1.add(Embedding(len(word_index) + 1,
300,
weights=[embedding_matrix],
input_length=40,
trainable=False))
model1.add(TimeDistributed(Dense(300, activation='relu')))
model1.add(Lambda(lambda x: K.sum(x, axis=1), output_shape=(300,)))
202. GloVe Embeddings
➢ Count based model
➢ Dimensionality reduction on co-occurrence counts matrix
➢ word-context matrix -> word-feature matrix
➢ Common Crawl
○ 840B tokens, 2.2M vocab, 300d vectors
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation
203. Basis of Deep Learning Model
➢ Keras-snli model: https://github.com/Smerity/keras_snli
204. Before Training DeepNets
➢ Tokenize data
➢ Convert text data to sequences
tk = text.Tokenizer(nb_words=200000)
max_len = 40
tk.fit_on_texts(list(data.question1.values) + list(data.question2.values.astype(str)))
x1 = tk.texts_to_sequences(data.question1.values)
x1 = sequence.pad_sequences(x1, maxlen=max_len)
x2 = tk.texts_to_sequences(data.question2.values.astype(str))
x2 = sequence.pad_sequences(x2, maxlen=max_len)
word_index = tk.word_index
205. Before Training DeepNets
➢ Initialize GloVe embeddings
embeddings_index = {}
f = open('data/glove.840B.300d.txt')
for line in tqdm(f):
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
206. Before Training DeepNets
➢ Create the embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, 300))
for word, i in tqdm(word_index.items()):
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
218. Time to Train the DeepNet
➢ Total params: 174,913,917
➢ Trainable params: 60,172,917
➢ Non-trainable params: 114,741,000
➢ NVIDIA Titan X
219.
220. Combined Results
The deep network was trained on
an NVIDIA TitanX and took
approximately 300 seconds for
each epoch and took 10-15 hours
to train. This network achieved
an accuracy of 0.848 (~0.85).
221. Improving Further
➢ Cleaning the text data, e.g correcting mis-spellings
➢ POS tagging
➢ Entity recognition
➢ Combining deepnet with traditional ML models
222. Conclusion & References
➢ The deepnet gives near state-of-the-art result
➢ BiMPM model accuracy: 88%
Some reference:
➢ Zhiguo Wang, Wael Hamza and Radu Florian. "Bilateral Multi-Perspective
Matching for Natural Language Sentences," (BiMPM)
➢ Matthew Honnibal. "Deep text-pair classification with Quora's 2017 question
dataset," 13 February 2017. Retreived at
https://explosion.ai/blog/quora-deep-text-pair-classification
➢ Bradley Pallen’s work:
https://github.com/bradleypallen/keras-quora-question-pairs
223.
224. Natural Language
Processing
Pre-trained domain
knowledge
Classification of intent
Identify entities
(extracting information)
API
Analytics
Delegation to customer support
Delegation to back-end robots
INSTANT PROCESSING and END-TO-END AUTOMATION
Monitoring and AI training
Chat
Avatar
Text
(Speech)
225. Pre-defined replyEnquiry
Intent classificationPre-processing of enquiry
Stemming
Cross-language
Misspellings algorithm
1. Insurance
2. Vehicle
3. Car
4.Rules for practice driving
Conversation without API
You don’t need to adjust your car
insurance when practise driving with
a learner’s permit. In case of damage
it’s the supervisor with a full driver’s
license that shall write and sign the
insurance claim
Hey you, do you
knoww if my car
insruacne covers
practice driving??
226. Hi James, what’s the weather in
Berlin on Thursday?
Thursday’s forecast for Berlin is
partly sunny and mostly clouds.
Required value
- Location
Optional value
- Date
Conversation with API
Redirect to API
- Weather
227.
228. Thank you!
Questions / Comments?
All The Code:
❖ github.com/abhishekkrthakur
Get in touch:
➢ E-mail: abhishek4@gmail.com
➢ LinkedIn: bit.ly/thakurabhishek
➢ Kaggle: kaggle.com/abhishek
➢ Twitter: @abhi1thakur
If everything fails, use Xgboost