SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Mining Social Media with Linked Open Data,
Entity Recognition and Event Extraction
Leon Derczynski
Kalina Bontcheva
Third Workshop on Data Extraction and Object Search,
Oxford,
7 July 2013
Social Media = Big Data
Gartner ''3V'' definition:
1.Volume
2.Velocity
3.Variety
High volume & velocity of messages:
Twitter has ~20 000 000 users per month
They write ~500 000 000 messages per day
Massive variety:
Stock markets;
Earthquakes;
Social arrangements;
… Bieber
What resources do we have now?
Large, content-rich, connected, digital streams of human discourse
We transfer knowledge via communication
Sampling communication gives a sample of human knowledge
''You've only done that which you can communicate''
The metadata (time – place – imagery) gives a richer resource:
→A sampling of human behaviour
Entity annotation components
Named entity recognition
dbpedia.org/resource/.....
Michael_Jackson
Michael_Jackson_(writer)
Linking entities
Named Entity Recognition
Goal is to find entities we might like to link
General accuracy on newswire: 89% F1
General accuracy on microblogs: 41% F1
L. Derczynski, D. Maynard, N. Aswani, K. Bontcheva. ''Microblog-Genre Noise and Impact on Semantic Annotation
Accuracy.'' 24th ACM Conference on Hypertext and Social Media. 2013
Newswire:
Microblog:
Gotta dress up for london fashion week and party in
style!!!
London Fashion Week grows up – but mustn't take
itself too seriously. Once a launching pad for new
designers, it is fast becoming the main event. But
LFW mustn't let the luxury and money crush its
sense of silliness.
NER difficulties
Rule-based systems get the bulk of entities (newswire 77% F1)
ML-based systems do well at the remainder (newswire 89% F1)
Small proportion of
difficult entities
Many complex issues
Using improved pipeline:
ML struggles, even with in-genre data: 49% F1
Rules cut through microblog noise: 80% F1
Word-level linking performance
Dataset: Ritter NER + DBpedia URIs
Detect mentions of entity in tweets
Crowdsourced annotations
Expert gold standard
Discard after disagreement or ambiguity
We disambiguate mentions to DBpedia / Wikipedia (easy to map)
General performance: F1 81%
Word-level linking issues
Automatic annotation:
Branching out from Lincoln park(LOC) after dark ... Hello "Russian
Navy(ORG)", it's like the same thing but with glitter!
Actual:
Branching out from Lincoln park after dark(PROD) ... Hello
"Russian Navy(PROD)", it's like the same thing but with glitter!
Clue in unusual collocations
+ ?
LODIE: LOD-based Inf. Extr.
Uses DBPedia as reference knowledge graph
Why DBPedia?
Regularly updated (from Wikipedia)
Good source for named entities
A hierarchy of concepts
A capital is also a city, but not vice versa
Relations between concepts
Paris locatedIn France
ParisHilton bornIn NewYorkCity
Demo: http://demos.gate.ac.uk/trendminer/obie/
LODIE: LOD-based Inf. Extr.
We increase recall by:
Deriving abbreviations from link anchor texts in Wikipedia
''She was born in <a href=''New_York_(city)''>NYC</a>''
Rank boosting terms using redirect pages
Matching NE candidates using include wild card queries (e.g.
Burton upon Trent and Burton-on-Trent)
This makes disambiguation harder (precision)
Use naive string, latent semantic, and contextual similarity metrics +
URI commonness to disambiguate
This is what achieved our good results!
Demo: http://demos.gate.ac.uk/trendminer/obie/
Social media contains events
How are events differently described in social media and news?
Conventional docs (e.g. newswire) have contextual info
Central event in distinct document segment (e.g. headline)
Location
Actors / participants
Causes
Outcomes
Similar prior events
This kind of description not found in social media
No editing guidelines
Often limited message length
Instead, event facets are represented sparsely
Only 1-2 facets per message about the event
Event extraction
Social media streams are punctuated with descriptions of events
… Accompanied by event facets
''Obama is visiting Russia''
''The US president has not visited Putin before''
Many viewpoints on the same temporal entity
(like triples)
How can we extract these?
We use the TimeML definitions of events in text:
Minimal lexicalisation (i.e. annotate one word)
Event classes: we focus on ACTIONs and OCCURRENCEs
Event extraction
How can we extract event mentions?
Conventional approaches are hybrid:
Statistical learning
Syntactic structures
Existing TimeML resources
TimeBank corpus (newswire)
Evita event extraction tool
Adapting to social media text
Negatively impacted by problems with NER
Short sentence structure
→ Use shallow linguistic techniques and fuzzy matches
Evita: F1 80.1
TIPSem: F1 81.4 (on well-formed text)
USFD Arcomem: F1 81.1 (noise-resilient)
LOD for event reassembly
What is needed to reassemble events from social media?
Identify mentions of the same event
Collect facets and integrate them
LOD gives unique identifiers for facet values
Many possible lexicalisations for the same event (run, control)
Identify co-referring mentions though:
Shared actors
Consistent facets (i.e. non-conflicting)
Lexical event similarity (e.g. wordnet)
This helps
cluster mentions of the same event
Aggolmerate facets
Final product: Event description grounded in linked open data
Conclusion
Event extraction from social media
using
linked open data
enables
extraction of rich event descriptions
Thank you!
Thank you for listening!
Do you have any questions?

Mais conteúdo relacionado

Mais procurados

Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Julien PLU
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
Daniel Katz
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
Arjen de Vries
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
Daniel Katz
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
Deepak K
 

Mais procurados (16)

2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning a...
 
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
Frontiers of Computational Journalism week 8 - Visualization and Network Anal...
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 6 - Professor...
 
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust networkBig Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
Big Data Analysis- Live DATA PRESENTATION- Bitcoin Alpha trust network
 
Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?Why L-3 Data Tactics Data Science?
Why L-3 Data Tactics Data Science?
 
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHIBig Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
Big Data Analytics- USE CASES SOLVED USING NETWORK ANALYSIS TECHNIQUES IN GEPHI
 
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia ChatbotEnhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
Enhancing Community Interactions with Data-Driven Chatbots--The DBpedia Chatbot
 
Looking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterpriseLooking beyond plain text for document representation in the enterprise
Looking beyond plain text for document representation in the enterprise
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
 
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
ICPSR - Complex Systems Models in the Social Sciences - Lecture 3 - Professor...
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
 
Fake News Detector
Fake News DetectorFake News Detector
Fake News Detector
 
Social Network Analysis - Visualization
Social Network Analysis - VisualizationSocial Network Analysis - Visualization
Social Network Analysis - Visualization
 
About the Social Semantic Web
About the Social Semantic WebAbout the Social Semantic Web
About the Social Semantic Web
 
Pydata Taipei 2020
Pydata Taipei 2020Pydata Taipei 2020
Pydata Taipei 2020
 

Semelhante a Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social Media
Arjen de Vries
 

Semelhante a Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction (20)

Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
 
Kdd 2014 tutorial bringing structure to text - chi
Kdd 2014 tutorial   bringing structure to text - chiKdd 2014 tutorial   bringing structure to text - chi
Kdd 2014 tutorial bringing structure to text - chi
 
Repositories thru the looking glass
Repositories thru the looking glassRepositories thru the looking glass
Repositories thru the looking glass
 
ESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social MediaESSIR 2013 - IR and Social Media
ESSIR 2013 - IR and Social Media
 
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
Citizen Sensing: Opportunities and Challenges in Mining Social Signals and Pe...
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
Guest Lecture: Linked Open Data for the Humanities and Social SciencesGuest Lecture: Linked Open Data for the Humanities and Social Sciences
Guest Lecture: Linked Open Data for the Humanities and Social Sciences
 
Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)Generating Storylines (Literature Survey)
Generating Storylines (Literature Survey)
 
Introduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and TerminologyIntroduction to Ontology Concepts and Terminology
Introduction to Ontology Concepts and Terminology
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application Profiles
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Oss swot
Oss swotOss swot
Oss swot
 
Understanding User-Community Engagement by Multi-faceted Features: A Case ...
Understanding User-Community Engagement by Multi-faceted Features: A Case ...Understanding User-Community Engagement by Multi-faceted Features: A Case ...
Understanding User-Community Engagement by Multi-faceted Features: A Case ...
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
MDST 3703 F10 Seminar 4
MDST 3703 F10 Seminar 4MDST 3703 F10 Seminar 4
MDST 3703 F10 Seminar 4
 
Journalism and the Semantic Web
Journalism and the Semantic WebJournalism and the Semantic Web
Journalism and the Semantic Web
 

Mais de Leon Derczynski

Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
Leon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
Leon Derczynski
 

Mais de Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answering
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

  • 1. Mining Social Media with Linked Open Data, Entity Recognition and Event Extraction Leon Derczynski Kalina Bontcheva Third Workshop on Data Extraction and Object Search, Oxford, 7 July 2013
  • 2.
  • 3. Social Media = Big Data Gartner ''3V'' definition: 1.Volume 2.Velocity 3.Variety High volume & velocity of messages: Twitter has ~20 000 000 users per month They write ~500 000 000 messages per day Massive variety: Stock markets; Earthquakes; Social arrangements; … Bieber
  • 4. What resources do we have now? Large, content-rich, connected, digital streams of human discourse We transfer knowledge via communication Sampling communication gives a sample of human knowledge ''You've only done that which you can communicate'' The metadata (time – place – imagery) gives a richer resource: →A sampling of human behaviour
  • 5. Entity annotation components Named entity recognition dbpedia.org/resource/..... Michael_Jackson Michael_Jackson_(writer) Linking entities
  • 6. Named Entity Recognition Goal is to find entities we might like to link General accuracy on newswire: 89% F1 General accuracy on microblogs: 41% F1 L. Derczynski, D. Maynard, N. Aswani, K. Bontcheva. ''Microblog-Genre Noise and Impact on Semantic Annotation Accuracy.'' 24th ACM Conference on Hypertext and Social Media. 2013 Newswire: Microblog: Gotta dress up for london fashion week and party in style!!! London Fashion Week grows up – but mustn't take itself too seriously. Once a launching pad for new designers, it is fast becoming the main event. But LFW mustn't let the luxury and money crush its sense of silliness.
  • 7. NER difficulties Rule-based systems get the bulk of entities (newswire 77% F1) ML-based systems do well at the remainder (newswire 89% F1) Small proportion of difficult entities Many complex issues Using improved pipeline: ML struggles, even with in-genre data: 49% F1 Rules cut through microblog noise: 80% F1
  • 8. Word-level linking performance Dataset: Ritter NER + DBpedia URIs Detect mentions of entity in tweets Crowdsourced annotations Expert gold standard Discard after disagreement or ambiguity We disambiguate mentions to DBpedia / Wikipedia (easy to map) General performance: F1 81%
  • 9. Word-level linking issues Automatic annotation: Branching out from Lincoln park(LOC) after dark ... Hello "Russian Navy(ORG)", it's like the same thing but with glitter! Actual: Branching out from Lincoln park after dark(PROD) ... Hello "Russian Navy(PROD)", it's like the same thing but with glitter! Clue in unusual collocations + ?
  • 10. LODIE: LOD-based Inf. Extr. Uses DBPedia as reference knowledge graph Why DBPedia? Regularly updated (from Wikipedia) Good source for named entities A hierarchy of concepts A capital is also a city, but not vice versa Relations between concepts Paris locatedIn France ParisHilton bornIn NewYorkCity Demo: http://demos.gate.ac.uk/trendminer/obie/
  • 11. LODIE: LOD-based Inf. Extr. We increase recall by: Deriving abbreviations from link anchor texts in Wikipedia ''She was born in <a href=''New_York_(city)''>NYC</a>'' Rank boosting terms using redirect pages Matching NE candidates using include wild card queries (e.g. Burton upon Trent and Burton-on-Trent) This makes disambiguation harder (precision) Use naive string, latent semantic, and contextual similarity metrics + URI commonness to disambiguate This is what achieved our good results! Demo: http://demos.gate.ac.uk/trendminer/obie/
  • 12. Social media contains events How are events differently described in social media and news? Conventional docs (e.g. newswire) have contextual info Central event in distinct document segment (e.g. headline) Location Actors / participants Causes Outcomes Similar prior events This kind of description not found in social media No editing guidelines Often limited message length Instead, event facets are represented sparsely Only 1-2 facets per message about the event
  • 13. Event extraction Social media streams are punctuated with descriptions of events … Accompanied by event facets ''Obama is visiting Russia'' ''The US president has not visited Putin before'' Many viewpoints on the same temporal entity (like triples) How can we extract these? We use the TimeML definitions of events in text: Minimal lexicalisation (i.e. annotate one word) Event classes: we focus on ACTIONs and OCCURRENCEs
  • 14. Event extraction How can we extract event mentions? Conventional approaches are hybrid: Statistical learning Syntactic structures Existing TimeML resources TimeBank corpus (newswire) Evita event extraction tool Adapting to social media text Negatively impacted by problems with NER Short sentence structure → Use shallow linguistic techniques and fuzzy matches Evita: F1 80.1 TIPSem: F1 81.4 (on well-formed text) USFD Arcomem: F1 81.1 (noise-resilient)
  • 15. LOD for event reassembly What is needed to reassemble events from social media? Identify mentions of the same event Collect facets and integrate them LOD gives unique identifiers for facet values Many possible lexicalisations for the same event (run, control) Identify co-referring mentions though: Shared actors Consistent facets (i.e. non-conflicting) Lexical event similarity (e.g. wordnet) This helps cluster mentions of the same event Aggolmerate facets Final product: Event description grounded in linked open data
  • 16. Conclusion Event extraction from social media using linked open data enables extraction of rich event descriptions
  • 17. Thank you! Thank you for listening! Do you have any questions?