This document provides an overview of information extraction (IE). It describes IE as the process of scanning text to extract relevant entities, relations, and events. The document outlines common IE tasks like named entity recognition and discusses approaches to IE like using cascaded finite-state transducers and learning-based methods. It also addresses challenges in IE like measuring performance and how systems are progressing towards overcoming the 60% accuracy barrier.
4. What is IE?
Information Extraction is the
process of scanning text for
relevant information to
some interest
Extract:
Entities
Relations
Events
Who did what to whom
when and where
5. Why IE?
Need for eficient processing of texts in
specialized domains
Focus on relevant parts, ignore the rest
Typical applications:
Gleaning business
Government
Military intelligence
WWW searches (more specific than keywords)
Scientific literature searches
…
6. Most common uses
Named Entity Recognition
Identify names, special
entities (dates, times)
Uses textual patterns
Important at biomedical
applications
IE is more than NER
Recognition of events
and their participants
7. How to measure
performance
Recall
What percentage of the correct answers did the
system get
Precision
What percentage of the system’s answers were
correct
F-score
Weighted harmonic mean between recall and
precision
9. Unstructured vs. Semi-structured
text
Unstructured
Natural language
sentences
Semantics depends on
linguistic analysis
Examples:
News stories
Magazines articles
Books
…
Semi-structured
Structured data
Semantics defined by its
organization
Physical layout plays role in
interpretation
Examples:
Job postings
Rental ads
…
10. Single-document vs. Multi-
document
Originally IE systems designed for individual
documents
Nowadays new systems to extract facts from WWW
Both use similar techniques
Distinguishing issue: redundancy
Multi-document can exploit redundancy
However need to challenge cross-document
coreference resolution
Multi-document IE systems also are referred as
open-domain
14. Complex Words
Identify multiwords, company names, people
names, locations, dates, times and basic entities
Recognition strategies:
Patterns
Dictionaries
Context
15. Basic Phrases
Some syntactic constructs can be
identified with reasonable
reliability:
Noun group
Verb group
Strategies:
Simple finite-state grammars
Ambiguities
Noun-verb ambiguity
Verbs locally ambiguous
Problems
Not al languages have high
distinction between noun and
verb groups
16. Complex Phrases
Recognize complex noun and verb groups
Complex noun groups
Appositives
Measure phrases
Prepositional attachments (of, for)
Noun group conjunction
Complex verb groups
Verb conjunction
Verb groups with same significance
Domain-relevant entities can be recognized
17. Domain Events
Ignore anything not identified in previous phases
Domain events require domain-specific patterns
for identification
Strategy:
Finite-state machines
Certain kind of “pseudo-syntax” can be done
Nowadays IE systems begin to rely in full-
sentence parsing
18. Template Generation:
Merging Structures
Previous stages operate within bounds of single
sentences
Operate over whole text to combine previous
collected information into a unified whole
If recognizing multiple events:
Determine how many distinct events
Assign each entity to appropriate event
20. Supervised Learning of
Extraction patterns & rules
Reduce knowledge engineering bottleneck
required to create an IE system for a new domain
Examples:
AutoSlog create lexico-syntactic patterns
PALKA patterns generalized based on words
semantics
LIEP identify syntactic paths related to roles
CRYSTAL “concept nodes” with lexical, syntactic
and semantic constrains
WHISK learn regular expressions
Many others: SRV, RAPIER, …
21. Supervised Learning of
sequential classifier models
View IE as a classification problem that can be
tackled using sequential learning models
Read sequentially and label each word as an
extraction or a non-extraction
Typical labeling scheme IOB
Inside
Outside
Beginning of desired extraction
Strategies:
Hidden Markov Models
Maximum Entropy Classifiers
Support Vector Machines
22. Weakly supervised and
unsupervised approaches
Annotating training text still requires time and
complexity
Further techniques to learn extraction using weakly
supervised and unsupervised systems
Examples
AutoSlog-TS (preclassifed corpus which texts identified
as relevant or irrelevant)
Ex-Disco (manually defined seed, patterns ranked, best
patterns selected added to seed)
Meta-bootstraping (seed nouns that belong to
semantic class)
On-Demand Information Extraction (dynamically learns
from queries)
23. Discourse-oriented
approaches to IE
Most IE systems patterns focus only on local
context surrounding
Extend systems to have more global view
Strategy:
Add constrains to connect entities in diferent
clauses
Decision trees (WRAP-UP)
Set of classifiers to identify new templates (ALICE)
25. How IE systems are
progressing?
The 60% barrier in performance
Biggest mistakes in entity and event coreference
The implicit knowledge on NL not translated to texts
Problems on training data not found on test data
Good IE systems typically recognize 90% of entities
An event requires about 4 entities
0.9*0.9*0.9*0.9 = 65.61%