Download software: http://gate.ac.uk/wiki/twitter-postagger.html
Original paper: http://derczynski.com/sheffield/papers/twitter_pos.pdf
Part-of-speech information is a pre-requisite in many NLP algorithms. However, Twitter text is difficult to part-of-speech tag: it is noisy, with linguistic errors and idiosyncratic style. We present a detailed error analysis of existing taggers, motivating a series of tagger augmentations which are demonstrated to improve performance. We identify and evaluate techniques for improving English part-of-speech tagging performance in this genre.
Further, we present a novel approach to system combination for the case where available taggers use different tagsets, based on vote-constrained bootstrapping with unlabeled data. Coupled with assigning prior probabilities to some tokens and handling of unknown words and slang, we reach 88.7% tagging accuracy (90.5% on development data). This is a new high in PTB-compatible tweet part-of-speech tagging, reducing token error by 26.8% and sentence error by 12.2%. The model, training data and tools are made available.
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
1. Twitter Part-of-Speech Tagging for All:
Overcoming Sparse and Noisy Data
Leon Derczynski
Alan Ritter
Sam Clark
Kalina Bontcheva
2. Streaming social media is powerful
● It's Big Data!
– Velocity: 500M tweets / day
– Volume: 20M users / month
– Variety: earthquakes, stocks, this guy
● Sample of all human discourse - unprecedented
● Not only where people are & when, but also
what they are doing
● Interesting stuff - just ask the NSA!
3. Tweets are dirty
● You all know what Twitter is, so let's just look at
some difficult tweets
● Orthography: Kk its 22:48 friday nyt :D really
tired so imma go to sleep :) good nyt x god
bles xxxxx
● Fragments: Bonfire tonite. All are welcome,
joe included
● Capitalisation: Don't Have Time To Stop In???
Then, Check Out Our Quick Full Service Drive
Thru Window :)
● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx
*kisses your ass**sneezes after* Lol
4. Tough tweets: Do we even care?
● Most tweets are linguistically fairly well-formed
● RT @DesignerDepot: Minimalist Web Design: When
Less is More - http://ow.ly/2FwyX
● just went on an unfollowing spree... there's no
point of following you if you haven't tweeted
in 10+ days. #justsaying ..
● The tweets we find most difficult, are those that
seem to say the least
● So im in tha chi whts popping tonight?
● i just gave my momma some money 4 a bill.... she
smiled when i put it n her hand __AND__ said "i
wanna go out to eat"... -______- HELLA SCAN
5. We do care
● However, there is utility in trivia:
– Sadilek: Predict if you will get flu, using spatial co-location and friend network
– Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus
– Emerging events: tendency to describe briefly
''There's a dead crow
in my garden''
@mari: i think im sick ugh..
6. Problem representation
● Tweets into finite tokens (PTB + URLs, Smileys)
● Put tokens in categories, depending on linguistic function
● Discriminative
– cases one by one
– e.g. unigram tagger
● Sequence labelling
– order matters!
– consider neighbouring labels
● Goal: label the whole sequence correctly
7. Word order still matters.. just
● Hard for tweets: exclamations and fragments
● Whole sequences a bit rare
● @FeeninforPretty making something to eat,
aint ate all day
● Peace green tea time!! Happyzone!!!! :)))))
● Sentence structure cues (e.g. caps) often:
– absent
– over-used
8. How do current tools do?
● Badly!
– Out of the box:
– Trained on Twitter,
IRC and WSJ data:
9. Where do they break?
● Continued work extending Stanford Tagger
● Terrible at doing whole sentences
– Best was 10% accuracy
– SotA on newswire about 55-60%
● Problems on unknown words – this is a good
target set to get better performance on
– 1 in 5 words completely unseen
– 27% token accuracy on this group
10. What errors occur on unknowns?
● Gold standard errors (dank_UH je_UH → _FW)
● Training lacks IV words (Internet, bake)
● Pre-taggables (URLs, mentions, retweets)
● NN vs. NNP (derek_NN, Bed_NNP)
● Slang (LUVZ, HELLA, 2night)
● Genre-specific (unfollowing)
● Tokenisation errors (ass**sneezes)
● Orthographic (suprising)
11. Do we have enough data?
● No, it's even worse than normal
– Ritter: 15K tokens, PTB, one annotator
– Foster: 14K tokens, PTB, low-noise
– CMU: 39K tokens, custom, narrow tagset
12. Tweet PoS-tagging issues
● From analysis, three big issues identified:
1. Many unseen words / orthographies
2. Uncertain sentence structure
3. Not enough annotated data
● Continued with Ritter dataset
13. Unseen words in tweets
● Two classes:
● Standard token, non-standard orthography;
– freinds
– KHAAAANNNNNNN!
● Non-standard token, standard orthography
– omg + bieber = omb
– Huntington
14. Unseen words in tweets
● Majority of non-standard orthographies can be
corrected with a gazetteer: typical Pareto
– vids → videos
– cussin → cursing
– hella → very
● No need to bother with e.g. Brown clustering
● 361 entries give 2.3% token error reduction
15. Unseen words in tweets
● The rest can handled reasonably with word
shape and contextual features
● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare
● Features include:
– word prefix and suffix shapes
– distribution of shape in corpus
– shapes of neighbouring words
● Corpus small, so adjust rare threshold
● +5.35% absolute token acc., +18.5% sentence
16. Tweet “sentence” “structure”
● They are structured (sometimes)
● We still do better if we look at global features
– Unigram tagger accuracy: 66%
● Sentence-level accuracy is important
– Unigram tagger sentence accuracy: 2.3%
17. Tweet “sentence” “structure”
● Tweets contain some constrained-form tokens
● Links, hashtags, user mentions, some smileys
● We can fix the label for these tokens
● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)
18. Tweet “sentence” “structure”
● This allows us to prune the transition graph of
labels in the sequence
● Because the graph is read in both directions,
fixing any label point impacts whole tweet
● Setting label priors reduces token error 5.03%
19. Not enough data
● Big unlabelled data - 75 000 000 tweets / day (en)
● Bootstrapping sometimes helps in this case
● Problem: initial accuracy is too low ● •︵ _UH
● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH
● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH
● Solution: Vote-constrained Bootstrapping _⊙ ʘ _UH
20. Vote-constrained bootstrapping
● Not many taggers available for building
semi-supervised data
● We chose Ritters plus the CMU tagger
● Where classes don't map 1:1
● Create equivalence classes between tags
– CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS)
– CMU tag !(interjection) → PTB (UH)
● Coarser tag constrains set of fine-grained tags
21. Vote-constrained bootstrapping
● Ask both taggers to label the candidate input
● Add tweet to semi-supervised data if both agree
●
Lebron_^ + Lebron_NNP → OK, Lebron_NNP
●
books_N + books_VBZ → Fail, reject tweet
● Evaluated quality on development set
– Agreed on 17.8% of tweets
– Of those, 97.4 of tokens correctly PTB labelled
– 71.3% whole tweets correctly labelled
22. Vote-constrained bootstrapping
● Results:
– Use Trendminer lang ID + data
– Collected 1.5M agreed-upon tokens
● Adding this bootstrapped data reduced error by:
– Token-level: 13.7% Sentence-level: 4.5%
www.trendminer-project.eu
23. Final results
● Unknown accuracy rate: from 27.8% to 74.5%
Token Sentence
Baseline: Ritter T-Pos 84.55 9.32
GATE: eval set 88.69 20.34
- error reduction 26.80 12.15
GATE: dev set 90.54 28.81
- error reduction 38.77 21.49
24. Where do we go next?
● Local tag sequence bounds?
● Better handling of hashtags
– I'm stressed at 9am, shopping on my lunch break...
can't deal w/ this today. #retailtherapy
– I'm so #bored today
● More data – bootstrapped
● More data – part-bootstrapped (e.g. CMU GS)
● More data – human annotated
● Parsing
25. Downloadable & Friendly
● As command-line tool; as GATE PR; as Stanford
Tagger model
● Included in GATE's TwitIE toolkit (4pm, Europa)
● 1.5M token dataset available
● Updates since submission:
– Better handling of contractions
– Less sensitive to tokenisation scheme
● Please play!
26. Thank you for your time!
There is hope:
Jersey Shore is overrated. studying and
history homework then a fat night of sleep!
Do you have any questions?
27. Owoputi et al.
● NAACL'13 paper: 90.5% token perf w/ PTB accuracy
● Advancement of the Gimpel tagger, used for our bootstrapping
● Late discovery: Can be adapted to PTB tagset with good
results
● We use disjoint techniques to Owoputi; combining them could
give an even better result!
● Our model readily re-usable and integrated into existing NLP
tool sets
28. Capitalisation
● Noisy tweets have unusual capitalisation, right?
– Buy Our Widgets Now
– ugh I haet u all .. stupd ppl #fml
● Lowercase model with lowercased data allows
us to ignore capitalisation noise
● Tried multiple approaches to classifying noisy
vs. well-formed capitalisation
● Gain from ignoring case in noisy tweets offset
by loss from mis-classified well-cased data