Speaker: Andriy Gryshchuk, Senior Research Engineer at Grammarly.
Summary: Paraphrase detection is a challenging NLP task since it requires both thorough syntactic and thorough semantic analysis to identify whether two phrases have the same intent. A few months ago, paraphrase identification became an objective of one of the most popular Kaggle competitions, Quora Question Pairs. In this talk, Yuriy Guts and Andriy Gryshchuk, silver medalists of the competition, will share their arsenal of statistical, linguistic, and Deep Learning approaches that helped them succeed in this challenge.
2. popularity
3,300 teams (>4,000 participants)
NLP
Features engineering
Deep Learning
Interesting and big enough dataset
Different from other recent competitions
3.
4. Goal - Find duplicate questions
Classification formulation:
For each pair of questions predict probability that the
questions have the same meaning
5. Data
Train set: 400,000 pairs of questions
(very big comparing with the previously available sets for paraphrase detection)
(question1, question2, is_duplicate)
Test set: 2,345,796 pairs
(some of them are artificially generated as anti-cheating)
Manually labeled (noisy)
6. Examples - positive
'Why have human beings evolved more than any other beings on Earth?'
'What technicality results in humans being more intelligent than other animals?'
'How Do You Protect Yourself from Dogs?'
'What is the best way to save yourself from an attacking angry dog?'
'Why are Quorians more liberal than conservative?'
'Why does Quora tend to attract more leftists than conservatives?'
7. Examples - Negatives
Examples - negatives
How to convert fractions to whole numbers?
How do you convert whole numbers into fractions?
What tips do you have for starting a school newspaper?
What are some tips on starting a school newspaper?
What Do I Do About My Boyfriend Ignoring Me?
What should I do when my boyfriend is ignoring me?
How dangerous is Mexico City?
Why is Mexico City dangerous?
What are some words that exist in English but do not exist in Japanese?
What are some words that exist in Japanese but do not exist in English?
8. Negatives are not random
There are positive pairs with no common words
There are negative pairs with all the words common
A lot of ambiguous cases
Noise
9. Metric
Logloss - questionable, ROC - could be much better choice
Very different distributions of the train and test sets
36% positives in trainset
17% positive in test set (public part)
Upsampling (or formula)
10. Metric
Logloss - questionable, ROC - could be much better choice
Very different distributions of the train and test sets
36% positives in trainset
17% positive in test set (public part)
Upsampling (or formula)
When distributions are different choose metric less
sensible to distribution changes
12. Approaches
Classical ML
90% efforts creating features 10% efforts modelling
Deep Learning
5% efforts creating features 95% efforts modelling
13. Approaches
Classical ML
90% efforts creating features 10% efforts modelling
Deep Learning
5% efforts creating features 95% efforts modelling
Kaggle way - Ensemble them all
14. Classical ML
90% efforts creating features
10% efforts modelling
My team has about 300 features
One of the top team claimed 4000 features
15. Sentence as Vector
Sentence vector - just mean of the word vectors
Or weighted mean - how to find right weights?
unsupervised methods
Similarities:
Cosine similarity
Cityblock distance
Euclidean distance
16. Raw Embeddings
Raw embeddings are surprisingly powerful features
Sentence to vector and just use vectors components as
features
18. Deep Learning
Modeling 95% of efforts
Features 5% of efforts
Pretrained embeddings of words are features
Pad and cut sentences to the same length
Start modelling
19. Ideas for NNs
Sentence embeddings computed just as the mean of the
word vectors are powerful
20. Ideas for NNs
Sentence embeddings computed just as the mean of the
word vectors are powerful
Weighted mean?
Non-linearity?
21. Ideas for NNs
Sentence embeddings computed just as the mean of the
word vectors are powerful
Weighted mean?
Non-linearity?
This is NN
22. Ideas for NNs
Sentence embeddings computed just as the mean of the
word vectors are powerful
Weighted mean?
Non-linearity
This is NN
Still just bag of words
23. Ideas for NNs
Sentence embeddings computed just as the mean of the
word vectors are powerful
Weighted mean?
Non-linearity
This is NN
Still just bag of words
N-grams?
24. Ideas for NNs
Sentence embeddings computed just as the mean of the
word vectors are powerful
Weighted mean?
Non-linearity
This is NN
Still just bag of words
N-grams?
This is convolutional NN
27. Paraphrase detection state of the art
Microsoft Research Paraphrase Corpus (~5,000 sentence pairs)
Results Table
Methods:
Unsupervised - phrase vector as weighted average
Autoencoder - better phrase vector
Supervised - CNN + structured features
28.
29. Previous works
Socher, R. and Huang, E.H., and Pennington, J. and Ng, A.Y., and Manning, C.D. (2011). Dynamic pooling
and unfolding recursive autoencoders for paraphrase detection
Dmitrijs Milajevs, Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, Matthew Purver
Evaluating Neural Word Representations in Tensor-Based Compositional Settings
He, Hua, Gimpel K. and Lin J. (2015). Multi-Perspective Sentence Similarity Modeling with Convolutional
Neural Networks
30. From He, Hua, Gimpel K. and Lin J. (2015). Multi-Perspective Sentence Similarity Modeling with
Convolutional Neural Networks
36. …
deep_lst = [conv_deep_lst(Conv1D, size, emb_mx.shape[1],
kernel_regularizer = kernel_regularizer,
activity_regularizer = activity_regularizer) for size in [3,4]]
a_deep = [apply_layers(f,a) for f in deep_lst]
b_deep = [apply_layers(f,b) for f in deep_lst]
dot_deep = [keras.layers.dot([a,b], normalize=True,axes=-1) for a,b in zip(a_deep,b_deep)]
….
43. RNNs vs CNNs
Similar accuracy
CNNs two orders of magnitude faster
Fast CNN allows to average many runs
44. More Feature to NN
Features created for Classifiers were added to NN
End-to-end promise is great but if you already have
features use them
45. Final model
Question 1 Question 2
Neural Network
All weights shared
Output
Fully Connected Layer “Classical” Features
46. Other NNs
RNNs - several order of magnitudes slower
Character level RNNs - very slow
RNNs with attention
NNs on the same features as tree-based classifiers
Top team reports that NNs on word vectors + classical
features work the best
Xgboost and alikes exploited the leak well
48. How to improve?
Deeper networks - would require dedicated embeddings
Positional embeddings
Transfer learning - apply a pre-trained Neural Translation
model and take the hidden state of the decoder as input
49. Ensemble
5-Folds on the first level
First level itself was average of several runs
Xgboost on the second level
CV unstable
“upsample-bagging” on the second level
Real bagging on the second level (800 rounds)
“third-level” - team ensemble (just weighted average)
51. Final Ensemble
20 rounds of “upsample-bagging” of Xgboost of 44 1st level
models
The team ensemble: 0.8*andriy’s model + 0.2*komaki’s
52. Unfortunate Event
Leak
50% of kaggle competitions have leaks, 20% have “killer” leaks
What about real life?
Be ready
53. Top team exploited the leak a lot
Difficult to compare genuine results
The leak could poison genuine features as well
Trainable embeddings might get info from the leak
Sampling process common reason of Kaggle’s leaks I would
suppose in real life it is true as well. Be careful.
54. Hyperparameters tuning
Ensembles give more than extensive tuning
Just simple average of two reasonable but different models
is better that one overtuned model
K-fold ensembles of different models beat everything
K-fold ensemble even for single model with one set of
hyperparameters
Overtuned models are fragile
Love tuning - regularize