Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk

Quora Question Pairs
Competition
by Andriy Gryshchuk

popularity
3,300 teams (>4,000 participants)
NLP
Features engineering
Deep Learning
Interesting and big enough dataset
Different from other recent competitions

Goal - Find duplicate questions
Classification formulation:
For each pair of questions predict probability that the
questions have the same meaning

Data
Train set: 400,000 pairs of questions
(very big comparing with the previously available sets for paraphrase detection)
(question1, question2, is_duplicate)
Test set: 2,345,796 pairs
(some of them are artificially generated as anti-cheating)
Manually labeled (noisy)

Examples - positive
'Why have human beings evolved more than any other beings on Earth?'
'What technicality results in humans being more intelligent than other animals?'
'How Do You Protect Yourself from Dogs?'
'What is the best way to save yourself from an attacking angry dog?'
'Why are Quorians more liberal than conservative?'
'Why does Quora tend to attract more leftists than conservatives?'

Examples - Negatives
Examples - negatives
How to convert fractions to whole numbers?
How do you convert whole numbers into fractions?
What tips do you have for starting a school newspaper?
What are some tips on starting a school newspaper?
What Do I Do About My Boyfriend Ignoring Me?
What should I do when my boyfriend is ignoring me?
How dangerous is Mexico City?
Why is Mexico City dangerous?
What are some words that exist in English but do not exist in Japanese?
What are some words that exist in Japanese but do not exist in English?

Negatives are not random
There are positive pairs with no common words
There are negative pairs with all the words common
A lot of ambiguous cases
Noise

Metric
Logloss - questionable, ROC - could be much better choice
Very different distributions of the train and test sets
36% positives in trainset
17% positive in test set (public part)
Upsampling (or formula)

Metric
Logloss - questionable, ROC - could be much better choice
Very different distributions of the train and test sets
36% positives in trainset
17% positive in test set (public part)
Upsampling (or formula)
When distributions are different choose metric less
sensible to distribution changes

Approaches
Classical ML vs Deep Learning

Approaches
Classical ML
90% efforts creating features 10% efforts modelling
Deep Learning

Approaches
Classical ML
Deep Learning
Kaggle way - Ensemble them all

Classical ML
90% efforts creating features
10% efforts modelling
My team has about 300 features
One of the top team claimed 4000 features

Sentence as Vector
Sentence vector - just mean of the word vectors
Or weighted mean - how to find right weights?
unsupervised methods
Similarities:
Cosine similarity
Cityblock distance
Euclidean distance

Raw Embeddings
Raw embeddings are surprisingly powerful features
Sentence to vector and just use vectors components as
features

Which Wordvectors
Glove, Word2vec?
50D, 100D, 200D, 300D?
All of them?
Ensembles improves when models run of different embeddings.

Deep Learning
Modeling 95% of efforts
Features 5% of efforts
Pretrained embeddings of words are features
Pad and cut sentences to the same length
Start modelling

Ideas for NNs
Sentence embeddings computed just as the mean of the
word vectors are powerful

Ideas for NNs
Weighted mean?
Non-linearity?

Ideas for NNs
Weighted mean?
Non-linearity?
This is NN

Ideas for NNs
Weighted mean?
Non-linearity
This is NN
Still just bag of words

Ideas for NNs
Weighted mean?
Non-linearity
This is NN
N-grams?

Ideas for NNs
Weighted mean?
Non-linearity
This is NN
N-grams?
This is convolutional NN

Symmetry asks for
Question 1 Question 2
Neural Network
All weights shared
Output

Common embedding layer
Fully Connected Layer
Conv
Block 1
Conv
Block N...

Paraphrase detection state of the art
Microsoft Research Paraphrase Corpus (~5,000 sentence pairs)
Results Table
Methods:
Unsupervised - phrase vector as weighted average
Autoencoder - better phrase vector
Supervised - CNN + structured features

Previous works
Socher, R. and Huang, E.H., and Pennington, J. and Ng, A.Y., and Manning, C.D. (2011). Dynamic pooling
and unfolding recursive autoencoders for paraphrase detection
Dmitrijs Milajevs, Dimitri Kartsaklis, Mehrnoosh Sadrzadeh, Matthew Purver
Evaluating Neural Word Representations in Tensor-Based Compositional Settings
He, Hua, Gimpel K. and Lin J. (2015). Multi-Perspective Sentence Similarity Modeling with Convolutional
Neural Networks

From He, Hua, Gimpel K. and Lin J. (2015). Multi-Perspective Sentence Similarity Modeling with
Convolutional Neural Networks

Convolutional Block as main component
Input 1 Input 2
Number of convolutional transformations
Global Pooling
one
number

Parameters of convolutional block
● Filter size
● Number of filters
● Global Pooling
● Depth
● Kernel regularizers, activity regularizer
● combine transformation (cosine, euclidean, cityblock)

Shallow Convolutional Block
def conv_lst4(layer_class, size, out_dim = 300, activation='relu',
kernel_regularizer = None, activity_regularizer = None):
res = []
res.append(layer_class(out_dim, size, activation=activation,
kernel_regularizer = kernel_regularizer,
activity_regularizer = activity_regularizer))
res_max = res.copy()
res_max.append(GlobalMaxPooling1D())
res_avg = res.copy()
res_avg.append(GlobalAveragePooling1D())
for res in [res_max, res_avg]:
res.append(Dense(out_dim, activation='linear'))
return [res_max, res_avg]

…
deep_lst = [conv_deep_lst(Conv1D, size, emb_mx.shape[1],
kernel_regularizer = kernel_regularizer,
activity_regularizer = activity_regularizer) for size in [3,4]]
a_deep = [apply_layers(f,a) for f in deep_lst]
b_deep = [apply_layers(f,b) for f in deep_lst]
dot_deep = [keras.layers.dot([a,b], normalize=True,axes=-1) for a,b in zip(a_deep,b_deep)]
….

Embeddings
Use pretrained?
Train your own?

Embeddings
Use pretrained?
Train your own?
Depends how much data you have

Trainable embeddings
Super powerful
Super easy to overfit
Regularize
L2 penalty for embedding weights
Average several runs

Two copies of embeddings
The same initial state (pretrained)
Trainable and frozen

{ 'name': 'nn_m8',
'fit_fun':fit_nn,
'fit_par': {
'n_iter':6,
'build_fun': partial(build_m8, train_emb = True, max_pool = True,
embeddings_regularizer = keras.regularizers.l2(1e-5),
n_more = X_train_stored.shape[1]),
'schedule': [(1e-3,5), (1e-5,2)],
'jit_sch':partial(jit_schedule, vol = 0.1)}
}
def jit_schedule(schedule, vol = 0.1):
for lr,ep in schedule:
lr = np.random.uniform(lr - vol*lr, lr + vol*lr)
yield lr,ep

RNNs vs CNNs
Similar accuracy
CNNs two orders of magnitude faster
Fast CNN allows to average many runs

More Feature to NN
Features created for Classifiers were added to NN
End-to-end promise is great but if you already have
features use them

Final model
Neural Network
All weights shared
Output
Fully Connected Layer “Classical” Features

Other NNs
RNNs - several order of magnitudes slower
Character level RNNs - very slow
RNNs with attention
NNs on the same features as tree-based classifiers
Top team reports that NNs on word vectors + classical
features work the best
Xgboost and alikes exploited the leak well

Analysis
Shallow convolutions
Just bag of words or bag of n-grams
No internal representation of “meaning” or “topic”

How to improve?
Deeper networks - would require dedicated embeddings
Positional embeddings
Transfer learning - apply a pre-trained Neural Translation
model and take the hidden state of the decoder as input

Ensemble
5-Folds on the first level
First level itself was average of several runs
Xgboost on the second level
CV unstable
“upsample-bagging” on the second level
Real bagging on the second level (800 rounds)
“third-level” - team ensemble (just weighted average)

Ensemble
('/meta_84_glove_6b_50d/nn_m8/', 0.17182514423484868),
('/meta_84_glove_6b_300d/nn_m8/', 0.17308944420181949),
('/meta_84_glove_6b_100d/nn_m8/', 0.17327907625486416),
('/meta_84_glove_6b_100d/gbm_tuned_00025/', 0.17386390869911419),
('/meta_84_glove_6b_200d/nn_m8/', 0.17478704276847895),
('/meta_84_glove_6b_100d/gbm_dart_01/', 0.17639511061431704),
('/meta_84_glove_6b_100d/xgb_02_d10/', 0.17660031146404326),
('/meta_83_glove_6b_100d/nn_m61/', 0.18009823071205816),
('/meta_83_glove_6b_50d/xgb_05/', 0.18513503621897476),
('/meta_83_glove_6b_100d/nn_m51_cn3/', 0.18574331177990389),
('/meta_83_glove_6b_200d/nn_m62/', 0.18607323372840762),
('/meta_83_glove_6b_50d/nn_m6/', 0.18646785119161874),
('/meta_82_glove_6b_50d/xgb_05/', 0.1875326701626234),

Final Ensemble
20 rounds of “upsample-bagging” of Xgboost of 44 1st level
models
The team ensemble: 0.8*andriy’s model + 0.2*komaki’s

Unfortunate Event
Leak
50% of kaggle competitions have leaks, 20% have “killer” leaks
What about real life?
Be ready

Top team exploited the leak a lot
Difficult to compare genuine results
The leak could poison genuine features as well
Trainable embeddings might get info from the leak
Sampling process common reason of Kaggle’s leaks I would
suppose in real life it is true as well. Be careful.

Hyperparameters tuning
Ensembles give more than extensive tuning
Just simple average of two reasonable but different models
is better that one overtuned model
K-fold ensembles of different models beat everything
K-fold ensemble even for single model with one set of
hyperparameters
Overtuned models are fragile
Love tuning - regularize

Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk

Semelhante a Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk (20)

Mais de Grammarly

Mais de Grammarly (14)

Último

Último (20)

Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk