1. Human Interface Laboratory
Pay Attention to Categories:
Syntax-Based Sentence Modeling with
Metadata Projection Matrix
2020. 10. 24, @PACLIC 34
Won Ik Cho, Nam Soo Kim (SNU ECE & INMC)
5. Introduction
• Brief overview on sentence modeling
Self-attentive models (which advances self attention)
• Still a useful approach to sentence classification
4
6. Introduction
• Motivation
What if we need to pay more attention to some syntactic categories and
want to decide the intensity automatically?
• e.g., Oxymoron detection (Cho et al., 2017)
– Drink a sugar-free sweet tea
– When I’m low, I get high
Attention mechanism is useful, but less reliable in the control regarding
syntactic categories
• Not mainly considered since the necessity is underestimated
• Minimal information (e.g., POS) can be of help for some tasks!
• How can we take into account that information beyond just attaching it to each
token?
5
7. Related Work
• Word embedding
Dense embedding of words, in view of distributional semantics
Projecting words to low dimensional space, with the objective of ‘making
the distribution that enables the prediction on the probable surrounding
words’
Variations of word2vec
• Word2Vec (Mikolov et al., 2013)
• GloVe (Pennington et al., 2014)
• fastText (Bojanovsky et al., 2016)
6
8. Related Work
• Deep learning techniques
Convolutional neural networks
• Primarily applied to the vision area
• Used in Kim (2014) as 1D convolution that represents the sentence as a
sequence of dense word vectors
Recurrent neural networks
• Adopted to model the sequential data
• Long short-term memory is used to cope with the vanishing gradient
• Bidirectional LSTM to consider the directivity
7
9. Related Work
• Deep learning techniques
Attention models
• First used to deal with the word order sensitivity and matching in machine
translation
• Evolved to various format, such as location-based
Self-attentive sentence embedding (Lin et al., 2017)
• Related to self attention, but more applicable to BiLSTM format
8
10. Related Work
• Deep learning techniques
Self-attentive sentence embedding
9
11. Proposed Method
• Overall description
Sequential word
embedding
and BiLSTM
Feature extraction
for attention source
• TF-IDF? BiLSTM?
Attention source
activated with ReLU
PAC structure with:
• Weight layer with
category-wise info
• Projection matrix
• Multiplication (𝛼1
𝐿
)
10
12. Proposed Method
• Overall description
Sequential word
embedding
and BiLSTM
Feature extraction
for attention source
• TF-IDF? BiLSTM?
Attention source
activated with ReLU
PAC structure with:
• Weight layer with
category-wise info
• Projection matrix
• Multiplication (𝛼1
𝐿
)
11
13. Proposed Method
• Overall description
Sequential word
embedding
and BiLSTM
Feature extraction
for attention source
• TF-IDF? BiLSTM?
Attention source
activated with ReLU
PAC structure with:
• Weight layer with
category-wise info
• Projection matrix
• Multiplication (𝛼1
𝐿
)
12
15. Experiment and Result
• Implementation
Baseline features
• TF-IDFs and bigrams
– Dictionary size set to 3,000 (=30 * 100)
• GloVe pretrained with Twitter 27B
– Word vector dim. 100
– Padding max length 30
Baseline classifiers
• SVM for TF-IDF
• NN for GloVe (averaged) with Adam(0.0005) and batch size 16
• CNN (32 filters, window 3) and BiLSTM (hidden dim. 64) for Glove (padded)
Baeseline attention model
• Lin et al. (2017) with context vector dim. 64, alongwith the above BiLSTM
The proposed
• 𝑛 𝑝 follows the NLTK POS tagging result
14
16. Experiment and Result
• Dataset
Metalanguage detection (2,393)
• Investigates whether a sentence contains explicit mention terms (‘title’ or ‘name’)
• Contains 629 mentioned and 1,764 not-mentioned instances excerpted from
Wikipedia
Irony detection (4,618)
• Distributed in SemEval 2018 Task 3 for ironic tweet detection (Van Hee et al., 2018).
• Binary label case was taken into account; 2,222 contain irony and 2,396 do not
Subjectivity detection (10,000)
• Refers to Pang and Lee (2004); checks if the movie review contains a subjective
judgment
• Incorporates equally 5,000 instances for each of the subjective and objective reviews
Stance classification (3,835)
• Part of distributed dataset from SemEval 2016 Task 6 (Mohammad et al., 2016)
• Labels corresponding to target, stance, opinion towards and sentiment information
• 1,205, 2,409, and 221 instances for favor, against, and none each
Sentiment classification (20,632)
• Utilizes the test data released in SemEval 2017 Task 4 (Rosenthal et al., 2017)
• Consists of 7,059 positive, 3,231 negative and 10,342 neutral tweets
15
17. Experiment and Result
• Result
The proposed surpasses the baseline results in META, IRONY, and SUBJ
• Which are expected to benefit from identifying the specific syntactic categories
• Relatively weak at discerning the latent information such as STANCE and SENT
Dependency on the source information
• META highly prefers the word-level attention source
– Concerns explicit existence of certain lexical terms
16
18. Experiment and Result
• Result
STANCE and IRONY requires contextual information as well, but works
better in IRONY?
• IRONY incorporates hashtagged information which matters in the prediction
Expected suitable application
• Bitstream or symbolic music analysis (where the formatted/syntactic information
plays an important role)
17
19. • Two excerpts from SUBJ
Enables to see which syntactic
category is the most important
Will be effective if the constituency
tagging is more reliable!
Visualization
18
20. Conclusion
• Sentence modeling concerns recent language modeling and deep
learning techniques
• Among attention approaches, how model pays attention is
determined automatically, but hardly cares the syntactic categories
that are given as a prior information
• Incorporating such information in deciding the attention weight via
projection matrix brings advantage in some tasks that necessitate
the attention on lexical features
19