SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Recognising and Interpreting
Named Temporal Expressions
Matteo Brucato
Leon Derczynski
Hector Llorens
Kalina Bontcheva
Christian S. Jensen
How do we talk about times?
● Calendar
● Closed class of terms
– tomorrow | today | yesterday
– [next | last ] [ week | month | year]
– [1 - 31] [January – December]
● Really deterministic
Wow, it's super-deterministic!
Wow, it's super-deterministic!
Credit: Kevin Knight
… sometimes
● TempEval-2 timex recall: 66 – 88 %
● TempEval-2 normalisation: 55 – 85 %
● ~150 rules needed to get to 81% (Angeli &
Uszkoreit '13)
● We can get the structured expressions OK
● But what about the rest?
Unstructured time mentions
– Christmas
– Michelmas
– Halloween
– Easter
● Can we learn how to recognise these?
Time expression diversity
● Current corpora too small to hold much linguistic variation
● Note characteristic knee in distribution (cf. Montemurro)
Named Temporal Expressions
● New class of timexes
– Doesn't look like a timex
– Doesn't sound like a timex
– … is, in fact, a timex
X
How can we mine and extract NTEs?
● Expensive to annotate and hope they appear
● Prefer an automated approach
– > Let's mine Wikipedia!
● 432 English NTEs found
NTEs in Wikipedia
● Gives term and text description
● Problem: no good as a gazetteer, some entries
are polysemous (e.g. Carnival)
● Problem: recall limited with gazetteers
● Solution: build statistical tagger
Building statistical NTE tagger
● Use list of NTEs to annotate sentences
– CoNLL format, I/O binary labels
● Only use monosemous expressions
● Visit linked data searching for expressions
● If many entities found, expression is polysemous
– SELECT DISTINCT ?r {?r rdfs:label "carnival"@en}
– Not monosemous
Building statistical NTE tagger
● If a sentence contains a monosemous NTE,
also annotate any polysemous NTEs
● Assume that they will occur in temporal sense
While it might not have the retail significance
of Christmas, Halloween or Secretary's Day,
Groundhog Day remains perhaps the weirdest
American holiday.
NTE recognition results
● Baseline: gazetteer of timexes in existing
resources
● 2:1 train:eval split, strict matching evaluation
● Also found new NTEs!
– European Cup
– Dayton Peace Agreement
How do we normalise NTEs?
● Target representation: TIMEX3
– January 2nd, 1980 → 1980-01-02
– Summer 2012 → 2012-SU
– now → PRESENT REF
● Statistical learning won't manage
● Use dedicated tool, TIMEN
– Open normalisation toolkit
– Anyone can contribute
– SotA normalisation performance
– Takes a document with entity boundaries marked
Using NTE descriptions
● We have semi-structured descriptions
– “six weeks after Easter”
– “last Friday in June”
– “end of week 17”
– “tenth day of Tishrei”
● How to convert these to rules?
NTE normalisation rule extraction
● Create simple parser to cover majority of NTEs
– “June 25th”
– “Last Sunday in March”
● Covers 70.3% of NTE descriptions
● Remainder of rules may be added manually
Normalisation + NTEs
● Evaluation
● Two corpora:
– SotA (TempEval-3)
– Purpose built to be hard to normalise (TimenEval)
● On TempEval-3 (restricted newswire):
0.7% error reduction
● On TimenEval (varied genre):
4.3% error reduction
Outstanding issues:
Spatial variation
● Labo[u]r Day
– May 1 in much of the world
– first Monday in May in Australia's QLD and NT
● Summer
– Official vs. informal
– North vs. south
Outstanding issues:
Easter
● Commonly used as an
offset
● Non-trivial to determine
● “Computus”
Outstanding issues:
Multiple calendars
● Gregorian (Quite popular)
– Not particularly rational in the first place
● Lunar (China)
● Astrological
● Hebrew
● .. and so on
Outstanding issues:
Forms of expression
● Orthographic variation:
– Martin Luther King Day
– MLK Day
● Regional variation:
– autumn
– fall
Resources provided
● Corpus of NTEs
● Rules integrated into TIMEN in next release
– around November 2013
Thank you for your time!
Do you have any questions?

Mais conteúdo relacionado

Mais de Leon Derczynski

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social MediaLeon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doLeon Derczynski
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy DataLeon Derczynski
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyLeon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceLeon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesLeon Derczynski
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringLeon Derczynski
 
A Corpus-based Study of Temporal Signals
A Corpus-based Study of Temporal SignalsA Corpus-based Study of Temporal Signals
A Corpus-based Study of Temporal SignalsLeon Derczynski
 
Word Sense Disambiguation and Induction
Word Sense Disambiguation and InductionWord Sense Disambiguation and Induction
Word Sense Disambiguation and InductionLeon Derczynski
 
Using signals to improve automatic classification of temporal relations
Using signals to improve automatic classification of temporal relationsUsing signals to improve automatic classification of temporal relations
Using signals to improve automatic classification of temporal relationsLeon Derczynski
 
An Annotation Scheme for Reichenbach's Verbal Tense Structure
An Annotation Scheme for Reichenbach's Verbal Tense StructureAn Annotation Scheme for Reichenbach's Verbal Tense Structure
An Annotation Scheme for Reichenbach's Verbal Tense StructureLeon Derczynski
 
RTMBank: Capturing Verbs with Reichenbach's Tense Model
RTMBank: Capturing Verbs with Reichenbach's Tense ModelRTMBank: Capturing Verbs with Reichenbach's Tense Model
RTMBank: Capturing Verbs with Reichenbach's Tense ModelLeon Derczynski
 

Mais de Leon Derczynski (19)

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Starting to Process Social Media
Starting to Process Social MediaStarting to Process Social Media
Starting to Process Social Media
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Microblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracyMicroblog-genre noise and its impact on semantic annotation accuracy
Microblog-genre noise and its impact on semantic annotation accuracy
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answering
 
A Corpus-based Study of Temporal Signals
A Corpus-based Study of Temporal SignalsA Corpus-based Study of Temporal Signals
A Corpus-based Study of Temporal Signals
 
Word Sense Disambiguation and Induction
Word Sense Disambiguation and InductionWord Sense Disambiguation and Induction
Word Sense Disambiguation and Induction
 
Using signals to improve automatic classification of temporal relations
Using signals to improve automatic classification of temporal relationsUsing signals to improve automatic classification of temporal relations
Using signals to improve automatic classification of temporal relations
 
An Annotation Scheme for Reichenbach's Verbal Tense Structure
An Annotation Scheme for Reichenbach's Verbal Tense StructureAn Annotation Scheme for Reichenbach's Verbal Tense Structure
An Annotation Scheme for Reichenbach's Verbal Tense Structure
 
RTMBank: Capturing Verbs with Reichenbach's Tense Model
RTMBank: Capturing Verbs with Reichenbach's Tense ModelRTMBank: Capturing Verbs with Reichenbach's Tense Model
RTMBank: Capturing Verbs with Reichenbach's Tense Model
 

Último

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Último (20)

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Recognising and Interpreting Named Temporal Expressions

  • 1. Recognising and Interpreting Named Temporal Expressions Matteo Brucato Leon Derczynski Hector Llorens Kalina Bontcheva Christian S. Jensen
  • 2. How do we talk about times? ● Calendar ● Closed class of terms – tomorrow | today | yesterday – [next | last ] [ week | month | year] – [1 - 31] [January – December] ● Really deterministic
  • 5. … sometimes ● TempEval-2 timex recall: 66 – 88 % ● TempEval-2 normalisation: 55 – 85 % ● ~150 rules needed to get to 81% (Angeli & Uszkoreit '13) ● We can get the structured expressions OK ● But what about the rest?
  • 6. Unstructured time mentions – Christmas – Michelmas – Halloween – Easter ● Can we learn how to recognise these?
  • 7. Time expression diversity ● Current corpora too small to hold much linguistic variation ● Note characteristic knee in distribution (cf. Montemurro)
  • 8. Named Temporal Expressions ● New class of timexes – Doesn't look like a timex – Doesn't sound like a timex – … is, in fact, a timex X
  • 9. How can we mine and extract NTEs? ● Expensive to annotate and hope they appear ● Prefer an automated approach – > Let's mine Wikipedia! ● 432 English NTEs found
  • 10. NTEs in Wikipedia ● Gives term and text description ● Problem: no good as a gazetteer, some entries are polysemous (e.g. Carnival) ● Problem: recall limited with gazetteers ● Solution: build statistical tagger
  • 11. Building statistical NTE tagger ● Use list of NTEs to annotate sentences – CoNLL format, I/O binary labels ● Only use monosemous expressions ● Visit linked data searching for expressions ● If many entities found, expression is polysemous – SELECT DISTINCT ?r {?r rdfs:label "carnival"@en} – Not monosemous
  • 12. Building statistical NTE tagger ● If a sentence contains a monosemous NTE, also annotate any polysemous NTEs ● Assume that they will occur in temporal sense While it might not have the retail significance of Christmas, Halloween or Secretary's Day, Groundhog Day remains perhaps the weirdest American holiday.
  • 13. NTE recognition results ● Baseline: gazetteer of timexes in existing resources ● 2:1 train:eval split, strict matching evaluation ● Also found new NTEs! – European Cup – Dayton Peace Agreement
  • 14. How do we normalise NTEs? ● Target representation: TIMEX3 – January 2nd, 1980 → 1980-01-02 – Summer 2012 → 2012-SU – now → PRESENT REF ● Statistical learning won't manage ● Use dedicated tool, TIMEN – Open normalisation toolkit – Anyone can contribute – SotA normalisation performance – Takes a document with entity boundaries marked
  • 15. Using NTE descriptions ● We have semi-structured descriptions – “six weeks after Easter” – “last Friday in June” – “end of week 17” – “tenth day of Tishrei” ● How to convert these to rules?
  • 16. NTE normalisation rule extraction ● Create simple parser to cover majority of NTEs – “June 25th” – “Last Sunday in March” ● Covers 70.3% of NTE descriptions ● Remainder of rules may be added manually
  • 17. Normalisation + NTEs ● Evaluation ● Two corpora: – SotA (TempEval-3) – Purpose built to be hard to normalise (TimenEval) ● On TempEval-3 (restricted newswire): 0.7% error reduction ● On TimenEval (varied genre): 4.3% error reduction
  • 18. Outstanding issues: Spatial variation ● Labo[u]r Day – May 1 in much of the world – first Monday in May in Australia's QLD and NT ● Summer – Official vs. informal – North vs. south
  • 19. Outstanding issues: Easter ● Commonly used as an offset ● Non-trivial to determine ● “Computus”
  • 20. Outstanding issues: Multiple calendars ● Gregorian (Quite popular) – Not particularly rational in the first place ● Lunar (China) ● Astrological ● Hebrew ● .. and so on
  • 21. Outstanding issues: Forms of expression ● Orthographic variation: – Martin Luther King Day – MLK Day ● Regional variation: – autumn – fall
  • 22. Resources provided ● Corpus of NTEs ● Rules integrated into TIMEN in next release – around November 2013
  • 23. Thank you for your time! Do you have any questions?