SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
Twitter Part-of-Speech Tagging for All:
Overcoming Sparse and Noisy Data
Leon Derczynski
Alan Ritter
Sam Clark
Kalina Bontcheva
Streaming social media is powerful
● It's Big Data!
– Velocity: 500M tweets / day
– Volume: 20M users / month
– Variety: earthquakes, stocks, this guy
● Sample of all human discourse - unprecedented
● Not only where people are & when, but also
what they are doing
● Interesting stuff - just ask the NSA!
Tweets are dirty
● You all know what Twitter is, so let's just look at
some difficult tweets
● Orthography: Kk its 22:48 friday nyt :D really
tired so imma go to sleep :) good nyt x god
bles xxxxx
● Fragments: Bonfire tonite. All are welcome,
joe included
● Capitalisation: Don't Have Time To Stop In???
Then, Check Out Our Quick Full Service Drive
Thru Window :)
● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx
*kisses your ass**sneezes after* Lol
Tough tweets: Do we even care?
● Most tweets are linguistically fairly well-formed
● RT @DesignerDepot: Minimalist Web Design: When
Less is More - http://ow.ly/2FwyX
● just went on an unfollowing spree... there's no
point of following you if you haven't tweeted
in 10+ days. #justsaying ..
● The tweets we find most difficult, are those that
seem to say the least
● So im in tha chi whts popping tonight?
● i just gave my momma some money 4 a bill.... she
smiled when i put it n her hand __AND__ said "i
wanna go out to eat"... -______- HELLA SCAN
We do care
● However, there is utility in trivia:
– Sadilek: Predict if you will get flu, using spatial co-location and friend network
– Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus
– Emerging events: tendency to describe briefly
''There's a dead crow
in my garden''
@mari: i think im sick ugh..
Problem representation
● Tweets into finite tokens (PTB + URLs, Smileys)
● Put tokens in categories, depending on linguistic function
● Discriminative
– cases one by one
– e.g. unigram tagger
● Sequence labelling
– order matters!
– consider neighbouring labels
● Goal: label the whole sequence correctly
Word order still matters.. just
● Hard for tweets: exclamations and fragments
● Whole sequences a bit rare
● @FeeninforPretty making something to eat,
aint ate all day
● Peace green tea time!! Happyzone!!!! :)))))
● Sentence structure cues (e.g. caps) often:
– absent
– over-used
How do current tools do?
● Badly!
– Out of the box:
– Trained on Twitter,
IRC and WSJ data:
Where do they break?
● Continued work extending Stanford Tagger
● Terrible at doing whole sentences
– Best was 10% accuracy
– SotA on newswire about 55-60%
● Problems on unknown words – this is a good
target set to get better performance on
– 1 in 5 words completely unseen
– 27% token accuracy on this group
What errors occur on unknowns?
● Gold standard errors (dank_UH je_UH → _FW)
● Training lacks IV words (Internet, bake)
● Pre-taggables (URLs, mentions, retweets)
● NN vs. NNP (derek_NN, Bed_NNP)
● Slang (LUVZ, HELLA, 2night)
● Genre-specific (unfollowing)
● Tokenisation errors (ass**sneezes)
● Orthographic (suprising)
Do we have enough data?
● No, it's even worse than normal
– Ritter: 15K tokens, PTB, one annotator
– Foster: 14K tokens, PTB, low-noise
– CMU: 39K tokens, custom, narrow tagset
Tweet PoS-tagging issues
● From analysis, three big issues identified:
1. Many unseen words / orthographies
2. Uncertain sentence structure
3. Not enough annotated data
● Continued with Ritter dataset
Unseen words in tweets
● Two classes:
● Standard token, non-standard orthography;
– freinds
– KHAAAANNNNNNN!
● Non-standard token, standard orthography
– omg + bieber = omb
– Huntington
Unseen words in tweets
● Majority of non-standard orthographies can be
corrected with a gazetteer: typical Pareto
– vids → videos
– cussin → cursing
– hella → very
● No need to bother with e.g. Brown clustering
● 361 entries give 2.3% token error reduction
Unseen words in tweets
● The rest can handled reasonably with word
shape and contextual features
● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare
● Features include:
– word prefix and suffix shapes
– distribution of shape in corpus
– shapes of neighbouring words
● Corpus small, so adjust rare threshold
● +5.35% absolute token acc., +18.5% sentence
Tweet “sentence” “structure”
● They are structured (sometimes)
● We still do better if we look at global features
– Unigram tagger accuracy: 66%
● Sentence-level accuracy is important
– Unigram tagger sentence accuracy: 2.3%
Tweet “sentence” “structure”
● Tweets contain some constrained-form tokens
● Links, hashtags, user mentions, some smileys
● We can fix the label for these tokens
● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)
Tweet “sentence” “structure”
● This allows us to prune the transition graph of
labels in the sequence
● Because the graph is read in both directions,
fixing any label point impacts whole tweet
● Setting label priors reduces token error 5.03%
Not enough data
● Big unlabelled data - 75 000 000 tweets / day (en)
● Bootstrapping sometimes helps in this case
● Problem: initial accuracy is too low ● •︵ _UH
● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH
● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH
● Solution: Vote-constrained Bootstrapping _⊙ ʘ _UH
Vote-constrained bootstrapping
● Not many taggers available for building
semi-supervised data
● We chose Ritters plus the CMU tagger
● Where classes don't map 1:1
● Create equivalence classes between tags
– CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS)
– CMU tag !(interjection) → PTB (UH)
● Coarser tag constrains set of fine-grained tags
Vote-constrained bootstrapping
● Ask both taggers to label the candidate input
● Add tweet to semi-supervised data if both agree
●
Lebron_^ + Lebron_NNP → OK, Lebron_NNP
●
books_N + books_VBZ → Fail, reject tweet
● Evaluated quality on development set
– Agreed on 17.8% of tweets
– Of those, 97.4 of tokens correctly PTB labelled
– 71.3% whole tweets correctly labelled
Vote-constrained bootstrapping
● Results:
– Use Trendminer lang ID + data
– Collected 1.5M agreed-upon tokens
● Adding this bootstrapped data reduced error by:
– Token-level: 13.7% Sentence-level: 4.5%
www.trendminer-project.eu
Final results
● Unknown accuracy rate: from 27.8% to 74.5%
Token Sentence
Baseline: Ritter T-Pos 84.55 9.32
GATE: eval set 88.69 20.34
- error reduction 26.80 12.15
GATE: dev set 90.54 28.81
- error reduction 38.77 21.49
Where do we go next?
● Local tag sequence bounds?
● Better handling of hashtags
– I'm stressed at 9am, shopping on my lunch break...
can't deal w/ this today. #retailtherapy
– I'm so #bored today
● More data – bootstrapped
● More data – part-bootstrapped (e.g. CMU GS)
● More data – human annotated
● Parsing
Downloadable & Friendly
● As command-line tool; as GATE PR; as Stanford
Tagger model
● Included in GATE's TwitIE toolkit (4pm, Europa)
● 1.5M token dataset available
● Updates since submission:
– Better handling of contractions
– Less sensitive to tokenisation scheme
● Please play!
Thank you for your time!
There is hope:
Jersey Shore is overrated. studying and
history homework then a fat night of sleep!
Do you have any questions?
Owoputi et al.
● NAACL'13 paper: 90.5% token perf w/ PTB accuracy
● Advancement of the Gimpel tagger, used for our bootstrapping
● Late discovery: Can be adapted to PTB tagset with good
results
● We use disjoint techniques to Owoputi; combining them could
give an even better result!
● Our model readily re-usable and integrated into existing NLP
tool sets
Capitalisation
● Noisy tweets have unusual capitalisation, right?
– Buy Our Widgets Now
– ugh I haet u all .. stupd ppl #fml
● Lowercase model with lowercased data allows
us to ignore capitalisation noise
● Tried multiple approaches to classifying noisy
vs. well-formed capitalisation
● Gain from ignoring case in noisy tweets offset
by loss from mis-classified well-cased data

Mais conteúdo relacionado

Semelhante a Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheapMarc Cluet
 
Roots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription WorkshopRoots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription WorkshopBen Brumfield
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATEDiana Maynard
 
Evan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATLEvan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATLMLconf
 
Think Like a Programmer
Think Like a ProgrammerThink Like a Programmer
Think Like a Programmerdaoswald
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Derek Buitenhuis
 
Estola meetup big_datacampla_6_14_evan_estola
Estola meetup big_datacampla_6_14_evan_estolaEstola meetup big_datacampla_6_14_evan_estola
Estola meetup big_datacampla_6_14_evan_estolaData Con LA
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxKaiduTester
 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Chris Gates
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the WorldYves Raimond
 
Self Trending a Tweet - Cluster and Topic Analysis on Tweets
Self Trending a Tweet - Cluster and Topic Analysis on TweetsSelf Trending a Tweet - Cluster and Topic Analysis on Tweets
Self Trending a Tweet - Cluster and Topic Analysis on TweetsMor Krispil
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...Dataiku
 
Analyze this
Analyze thisAnalyze this
Analyze thisAjay Ohri
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Jeongkyu Shin
 
Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at TwitterBill Graham
 
Dealing with Contributor Overload - Linux Conf AU Jan 2018
Dealing with Contributor Overload -  Linux Conf AU Jan 2018Dealing with Contributor Overload -  Linux Conf AU Jan 2018
Dealing with Contributor Overload - Linux Conf AU Jan 2018Holden Karau
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithNETWAYS
 

Semelhante a Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data (20)

Messaging
MessagingMessaging
Messaging
 
Scalable, good, cheap
Scalable, good, cheapScalable, good, cheap
Scalable, good, cheap
 
Messaging
MessagingMessaging
Messaging
 
Roots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription WorkshopRoots and Routes: Crowdsourced Manuscript Transcription Workshop
Roots and Routes: Crowdsourced Manuscript Transcription Workshop
 
Social media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATESocial media analytics as a service: tools from GATE
Social media analytics as a service: tools from GATE
 
Evan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATLEvan Estola – Data Scientist, Meetup.com at MLconf ATL
Evan Estola – Data Scientist, Meetup.com at MLconf ATL
 
Think Like a Programmer
Think Like a ProgrammerThink Like a Programmer
Think Like a Programmer
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
Every Solution is Wrong: Normalizing Ambiguous, Broken, and Pants-on-Head Cra...
 
Estola meetup big_datacampla_6_14_evan_estola
Estola meetup big_datacampla_6_14_evan_estolaEstola meetup big_datacampla_6_14_evan_estola
Estola meetup big_datacampla_6_14_evan_estola
 
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptxNeurIPS_2018_ConvAI2_ParticipantSlides.pptx
NeurIPS_2018_ConvAI2_ParticipantSlides.pptx
 
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
Adversarial Simulation Nickerson/Gates Wild West Hacking Fest Oct 2017
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the World
 
Self Trending a Tweet - Cluster and Topic Analysis on Tweets
Self Trending a Tweet - Cluster and Topic Analysis on TweetsSelf Trending a Tweet - Cluster and Topic Analysis on Tweets
Self Trending a Tweet - Cluster and Topic Analysis on Tweets
 
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...Dataiku   hadoop summit - semi-supervised learning with hadoop for understand...
Dataiku hadoop summit - semi-supervised learning with hadoop for understand...
 
Analyze this
Analyze thisAnalyze this
Analyze this
 
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
Let Android dream electric sheep: Making emotion model for chat-bot with Pyth...
 
Presto at Twitter
Presto at TwitterPresto at Twitter
Presto at Twitter
 
Dealing with Contributor Overload - Linux Conf AU Jan 2018
Dealing with Contributor Overload -  Linux Conf AU Jan 2018Dealing with Contributor Overload -  Linux Conf AU Jan 2018
Dealing with Contributor Overload - Linux Conf AU Jan 2018
 
OSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles JudithOSMC 2019 | How to improve database Observability by Charles Judith
OSMC 2019 | How to improve database Observability by Charles Judith
 

Mais de Leon Derczynski

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and VeracityLeon Derczynski
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018Leon Derczynski
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingLeon Derczynski
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social MediaLeon Derczynski
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Leon Derczynski
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social MediaLeon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doLeon Derczynski
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsLeon Derczynski
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyLeon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceLeon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesLeon Derczynski
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringLeon Derczynski
 

Mais de Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answering
 

Último

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Último (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data

  • 1. Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data Leon Derczynski Alan Ritter Sam Clark Kalina Bontcheva
  • 2. Streaming social media is powerful ● It's Big Data! – Velocity: 500M tweets / day – Volume: 20M users / month – Variety: earthquakes, stocks, this guy ● Sample of all human discourse - unprecedented ● Not only where people are & when, but also what they are doing ● Interesting stuff - just ask the NSA!
  • 3. Tweets are dirty ● You all know what Twitter is, so let's just look at some difficult tweets ● Orthography: Kk its 22:48 friday nyt :D really tired so imma go to sleep :) good nyt x god bles xxxxx ● Fragments: Bonfire tonite. All are welcome, joe included ● Capitalisation: Don't Have Time To Stop In??? Then, Check Out Our Quick Full Service Drive Thru Window :) ● Nonverbal acts: RT @Huddy85: @Mz_Twilightxxx *kisses your ass**sneezes after* Lol
  • 4. Tough tweets: Do we even care? ● Most tweets are linguistically fairly well-formed ● RT @DesignerDepot: Minimalist Web Design: When Less is More - http://ow.ly/2FwyX ● just went on an unfollowing spree... there's no point of following you if you haven't tweeted in 10+ days. #justsaying .. ● The tweets we find most difficult, are those that seem to say the least ● So im in tha chi whts popping tonight? ● i just gave my momma some money 4 a bill.... she smiled when i put it n her hand __AND__ said "i wanna go out to eat"... -______- HELLA SCAN
  • 5. We do care ● However, there is utility in trivia: – Sadilek: Predict if you will get flu, using spatial co-location and friend network – Sugumaran, U. Northern Iowa. Crow corpse reports precede West Nile Virus – Emerging events: tendency to describe briefly ''There's a dead crow in my garden'' @mari: i think im sick ugh..
  • 6. Problem representation ● Tweets into finite tokens (PTB + URLs, Smileys) ● Put tokens in categories, depending on linguistic function ● Discriminative – cases one by one – e.g. unigram tagger ● Sequence labelling – order matters! – consider neighbouring labels ● Goal: label the whole sequence correctly
  • 7. Word order still matters.. just ● Hard for tweets: exclamations and fragments ● Whole sequences a bit rare ● @FeeninforPretty making something to eat, aint ate all day ● Peace green tea time!! Happyzone!!!! :))))) ● Sentence structure cues (e.g. caps) often: – absent – over-used
  • 8. How do current tools do? ● Badly! – Out of the box: – Trained on Twitter, IRC and WSJ data:
  • 9. Where do they break? ● Continued work extending Stanford Tagger ● Terrible at doing whole sentences – Best was 10% accuracy – SotA on newswire about 55-60% ● Problems on unknown words – this is a good target set to get better performance on – 1 in 5 words completely unseen – 27% token accuracy on this group
  • 10. What errors occur on unknowns? ● Gold standard errors (dank_UH je_UH → _FW) ● Training lacks IV words (Internet, bake) ● Pre-taggables (URLs, mentions, retweets) ● NN vs. NNP (derek_NN, Bed_NNP) ● Slang (LUVZ, HELLA, 2night) ● Genre-specific (unfollowing) ● Tokenisation errors (ass**sneezes) ● Orthographic (suprising)
  • 11. Do we have enough data? ● No, it's even worse than normal – Ritter: 15K tokens, PTB, one annotator – Foster: 14K tokens, PTB, low-noise – CMU: 39K tokens, custom, narrow tagset
  • 12. Tweet PoS-tagging issues ● From analysis, three big issues identified: 1. Many unseen words / orthographies 2. Uncertain sentence structure 3. Not enough annotated data ● Continued with Ritter dataset
  • 13. Unseen words in tweets ● Two classes: ● Standard token, non-standard orthography; – freinds – KHAAAANNNNNNN! ● Non-standard token, standard orthography – omg + bieber = omb – Huntington
  • 14. Unseen words in tweets ● Majority of non-standard orthographies can be corrected with a gazetteer: typical Pareto – vids → videos – cussin → cursing – hella → very ● No need to bother with e.g. Brown clustering ● 361 entries give 2.3% token error reduction
  • 15. Unseen words in tweets ● The rest can handled reasonably with word shape and contextual features ● Using edu.stanford.nlp.tagger.maxent.ExtractorFramesRare ● Features include: – word prefix and suffix shapes – distribution of shape in corpus – shapes of neighbouring words ● Corpus small, so adjust rare threshold ● +5.35% absolute token acc., +18.5% sentence
  • 16. Tweet “sentence” “structure” ● They are structured (sometimes) ● We still do better if we look at global features – Unigram tagger accuracy: 66% ● Sentence-level accuracy is important – Unigram tagger sentence accuracy: 2.3%
  • 17. Tweet “sentence” “structure” ● Tweets contain some constrained-form tokens ● Links, hashtags, user mentions, some smileys ● We can fix the label for these tokens ● Knowing P(ci) constrains both P(ci-1|ci) and P(ci+1|ci)
  • 18. Tweet “sentence” “structure” ● This allows us to prune the transition graph of labels in the sequence ● Because the graph is read in both directions, fixing any label point impacts whole tweet ● Setting label priors reduces token error 5.03%
  • 19. Not enough data ● Big unlabelled data - 75 000 000 tweets / day (en) ● Bootstrapping sometimes helps in this case ● Problem: initial accuracy is too low ● •︵ _UH ● Solution: consensus with > 1 tagger ◕ ◡ ◕ _UH ● Problem: only one tagger using PTB tags ⋋〴 _⋌ 〵 _UH ● Solution: Vote-constrained Bootstrapping _⊙ ʘ _UH
  • 20. Vote-constrained bootstrapping ● Not many taggers available for building semi-supervised data ● We chose Ritters plus the CMU tagger ● Where classes don't map 1:1 ● Create equivalence classes between tags – CMU tag R (adverb) → PTB (WRB,RB,RBR,RBS) – CMU tag !(interjection) → PTB (UH) ● Coarser tag constrains set of fine-grained tags
  • 21. Vote-constrained bootstrapping ● Ask both taggers to label the candidate input ● Add tweet to semi-supervised data if both agree ● Lebron_^ + Lebron_NNP → OK, Lebron_NNP ● books_N + books_VBZ → Fail, reject tweet ● Evaluated quality on development set – Agreed on 17.8% of tweets – Of those, 97.4 of tokens correctly PTB labelled – 71.3% whole tweets correctly labelled
  • 22. Vote-constrained bootstrapping ● Results: – Use Trendminer lang ID + data – Collected 1.5M agreed-upon tokens ● Adding this bootstrapped data reduced error by: – Token-level: 13.7% Sentence-level: 4.5% www.trendminer-project.eu
  • 23. Final results ● Unknown accuracy rate: from 27.8% to 74.5% Token Sentence Baseline: Ritter T-Pos 84.55 9.32 GATE: eval set 88.69 20.34 - error reduction 26.80 12.15 GATE: dev set 90.54 28.81 - error reduction 38.77 21.49
  • 24. Where do we go next? ● Local tag sequence bounds? ● Better handling of hashtags – I'm stressed at 9am, shopping on my lunch break... can't deal w/ this today. #retailtherapy – I'm so #bored today ● More data – bootstrapped ● More data – part-bootstrapped (e.g. CMU GS) ● More data – human annotated ● Parsing
  • 25. Downloadable & Friendly ● As command-line tool; as GATE PR; as Stanford Tagger model ● Included in GATE's TwitIE toolkit (4pm, Europa) ● 1.5M token dataset available ● Updates since submission: – Better handling of contractions – Less sensitive to tokenisation scheme ● Please play!
  • 26. Thank you for your time! There is hope: Jersey Shore is overrated. studying and history homework then a fat night of sleep! Do you have any questions?
  • 27. Owoputi et al. ● NAACL'13 paper: 90.5% token perf w/ PTB accuracy ● Advancement of the Gimpel tagger, used for our bootstrapping ● Late discovery: Can be adapted to PTB tagset with good results ● We use disjoint techniques to Owoputi; combining them could give an even better result! ● Our model readily re-usable and integrated into existing NLP tool sets
  • 28. Capitalisation ● Noisy tweets have unusual capitalisation, right? – Buy Our Widgets Now – ugh I haet u all .. stupd ppl #fml ● Lowercase model with lowercased data allows us to ignore capitalisation noise ● Tried multiple approaches to classifying noisy vs. well-formed capitalisation ● Gain from ignoring case in noisy tweets offset by loss from mis-classified well-cased data