Information Extraction

Information
Extraction

Ruben Izquierdo
ruben.izquierdobevia@vu.nl
http://rubenizquierdobevia.com

Text Mining Course
•  1) Introduction to Text Mining
•  2) Introduction to NLP
•  3) Named Entity Recognition and Disambiguation
•  4) Opinion Mining and Sentiment Analysis
•  5) Information Extraction
•  6) NewsReader and Visualisation
•  7) Guest Lecture and Q&A

Outline
1.  What is Information Extraction
2.  Main goals of Information Extraction
3.  Information Extraction Tasks and Subtasks
4.  MUC conferences
5.  Main domains of Information Extraction
6.  Methods for Information Extraction
o  Cascaded finite-state transducers
o  Regular expressions and patterns
o  Supervised learning approaches
o  Weakly supervised and unsupervised approaches
7.  How far we are with IE

What is IE?
•  Late 1970s within NLP field
•  Find and extract automatically limited relevant
parts of texts
•  Merge information from many pieces of text

What is IE?
•  Quite often in specialized domains
•  Move from unstructured/semi-structured data to
structured data
o  Schemas
o  Relations (as a database)
o  Knowledge base
o  RDF triples

What is IE?
Unstructured text
•  Natural language sentences
•  Historically NLP system have been designed to process this type of data
•  The meaning à linguistic analysis and natural language understanding

What is IE?
Semi-‐‑structured text
•  The physical layout helps to the interpretation
•  Processing half way linguistic features ßà positional features

Main goals of IE
•  Fill a predefined “template” from raw text
•  Extract who did what to whom and when?
o  Event extraction
•  Organize information so that is useful to people
•  Put information in a form that allows further
inferences by computers
o  Big data

IE. Task & Subtasks
•  Named Entity Recognition
o  Detection à Mr. Smith eats bitterballen [Mr. Smith] : ENTITY
o  Classification à Mr. Smith eats bitterballen [Mr. Smith] : PERSON
•  Event extraction
o  The thief broke the door with a hammer
•  CAUSE_HARMà Verb: break
Agent: the thief
Patient: the door
Instrument: a hammer
•  Coreference resolution
o  [Mr. Smith] eats bitterballen. Besides to this, [he] only drinks Belgium beer.

IE. Task & Subtasks
•  Relationship extraction
o  Bill works for IBM PERSON works for ORGANISATION
•  Terminology extraction
o  Finding relevant terms of multi words from a given corpus
•  Some concrete examples
o  Extracting earnings, profits, board members, headquarters from company
reports
o  Searching on the WWW for e-mails for advertising (spamming)
o  Learn drug-gene product interactions from biomedical research papers

IE Tasks & Subtasks
•  Apple mail

MUC conferences
•  Message Understanding Conference (MUC), held
between 1987 and 1998.
•  Domain specific texts + training examples + template
definition
•  Precision, Recall and F1 as evaluation
•  Domains
o  MUC-1 (1987), MUC-2 (1989): Naval operations messages.
o  MUC-3 (1991), MUC-4 (1992): Terrorism in Latin American countries.
o  MUC-5 (1993): Joint ventures and microelectronics domain.
o  MUC-6 (1995): News articles on management changes.
o  MUC-7 (1998): Satellite launch reports.

MUC conferences
Bridgestone Sports Co. said Friday it has set up a joint venture in
Taiwan with a local concern and a Japanese trading house to produce
golf clubs to be shipped to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20
million new Taiwan dollars, will start production in January 1990 with
production of 20,000 iron and “metal wood” clubs a month.
Example from MUC5

Main domains of IE
•  Terrorist events
•  Joint ventures
•  Plane crashes
•  Disease outbreaks
•  Seminar announcements
•  Biological and medical domain

Methods for IE
•  Cascaded finite-state transducers
o  Rule based
o  Regular expressions
•  Learning based approaches
o  Traditional classifiers
•  Bayes, MME, SVM …
o  Sequence label models
•  HMM, CMM, CRF
•  Unsupervised approaches
•  Hybrid approaches

Cascaded ﬁnite-‐‑state
transducers
•  Emerging idea from MUC participants and
approaches
•  Decompose the task into small sub-tasks
•  One element is read at a time from a sequence
o  Depending on the type a certain transition in produced in the automaton
to a new state
o  Some states are considered final (the input matches a certain pattern)
•  Can be defined as a regular expression

transducers
Finite Automaton for noun groups
=> John’s interesting book with a nice cover

transducers
•  Earlier stages recognize smaller linguistics objects
o  Usually domain independent
•  Later stages build on top of the previous ones
o  Usually domain dependent
•  Typical IE systems
1.  Complex words
2.  Basic phrases
3.  Complex phrases
4.  Domain events
5.  Merging structures

transducers
•  Complex words
o  Multiwords: “set up” “trading house”
o  NE: “Bridgestone Sports Co”
•  Basic Phrases
o  Syntactic chunking
•  Noun groups (head noun + all modifiers)
•  Verb groups

transducers

transducers
•  Complex phrases
o  Complex noun and verb groups on the basis of syntactic information
•  The attachment of appositives to their head noun group
o  “The joint venture, Bridgestone Sports Taiwan Co.,”
•  The construction of measure phrases
o  “20,000 iron and ‘metal wood’ clubs a month”

transducers
•  Domain events
o  Recognize events and match with “fillers” detected in previous steps
o  Requires domain specific patterns
•  To recognize phrases of interest
•  To define what are the roles
o  Patterns can be defined also as a finite-state machines or regular
expressions
•  <Company/ies><Set-up><Joint-Venture> with <Company/ies>
•  <Company><Capitalized> at <Currency>

Regular Expressions
•  1950’s Stephen Kleene
•  A string pattern that describes/matches a set of
strings
•  A regular expression consists of:
o  Characters
o  Operation symbols
•  Boolean (and/or)
•  Grouping (for defining scopes)
•  Quantification

Regular Expressions
Character
Description
a
The character a
.
Any single character
[abc]
Any character in the brackets (OR) ‘a’
or ‘b’ or ‘c’
[âbc]

Any character not in the brackets. Any
symbol that is not ‘a ‘ or ‘b’ or ‘c’
*
Quantifier. Matches the preceding
element ZERO or more times
+
Quantifier. Matches the preceding
element ONE or more times
?
Matches the previous element zero or
one time
|
Choice (OR) Matches one of the
expressions (before of after the |)

Regular Expressions
①  .at è ???

Regular Expressions
①  .at è hat cat bat xat …
②  [hc]at è hat cat
③  [^b]at è all matched by .at but “bat”
④  [^hc]at è all match by .at but “hat” and
“cat”
⑤  s.* è s sssss ssbsd2ck3e

Regular Expressions
①  .at è hat cat bat xat …
②  [hc]at è hat cat
③  [^b]at è all matched by .at but “bat”
④  [^hc]at è all match by .at but “hat” and
“cat”
⑤  s.* è s sssss ssbsd2ck3e
⑥  [hc]*at è hat cat hhat chat cchhat at …
⑦  cat|dogè cat dog
⑧  ….
⑨  ….

Using Regular
Expressions
•  Typically extracting information from automatic
generated webpages is easy
o  Wikipedia
•  To know the country for a given city
o  Amazon webpage
•  From a list of hits
o  Weather forecast webpages
o  DBpedia

Using Regular
Expressions

Using Regular
Expressions
•  Some “unstructured” pieces of information keep
some structure and are easy to capture by means
of regular expressions
o  Phone numbers
o  What else?
o  …
o  ...

Using Regular
Expressions
•  Some “unstructured” pieces of information keep
some structure and are easy to capture by means
of regular expressions
o  Phone numbers
o  E-mails
o  URL Websites

Using Regular
Expressions
•  Also to detect relations and fill events
•  Higher level regular expressions make use of
“objects” detected by lower level patterns
•  Some NLP information may help (pos tags, phrases,
semantic word categories)
o  Crime-Victim can use things matched by “noun-group”
•  Prefiller: [pos: V, type-of-verb: KILL] WordNet MCR
•  Filler: [phrase: NOUN-GROUP]

Using Regular
Expressions
•  Extraction relations between entities
o  Which PERSON holds what POSITION in what ORGANIZATION
•  [PER], [POSITION] of [ORG]
Entities:

PER: Jose Mourinho

POSITION: trainer

ORG: Chelsea

Relation

Jose Mourinho

Trainer

Chelsea

Using Regular
Expressions
•  Extraction relations between entities
o  Which PERSON holds what POSITION in what ORGANIZATION
•  [PER], [POSITION ] of [ORG]
•  [ORG] (named, appointed,…) [PER] Prep [POSITION]
o  Nokia has appointed Rajeev Suri as President
o  Where a ORGANIZATION is located
•  [ORG] headquarters in [LOC]
o  NATO headquarters in Brussels
•  [ORG][LOC] (division, branch, headquarters…)
o  KFOR Kosovo headquarters

Extracting relations with
palerns
•  Hearst 1992
•  What does Gelidium mean?
•  “Αγαρ ισ α συβστανχε πρεπαρεδ φροµ α µιξτυρε οφ ρεδ αλγαε, συχη ασ
Gelidium, φορ λαβορατορψ ορ ινδυστριαλ υσε”

palerns
•  Hearst 1992
•  What does Gelidium mean?
•  “Agar is a substance prepared from a mixture of red
algae, such as Gelidium, for laboratory or industrial
use”
•  How do you know?

palerns
•  Hearst 1992: Automatic Acquisition of Hyponyms (IS-A)
X à Gelidium (sub-type) Y à red algae (super-type)
X à IS-A à Y
•  “Y such as X”
•  “Y, such as X”
•  “X or other Y”
•  “X and other Y”
•  “Y including X”
•  ….

palerns

Hand-‐‑built palerns
•  Positive
o  Tend to be high-precision
o  Can be adapted to specific domains
•  Negative
o  Human patterns are usually low-recall
o  A lot of work to think all possible patterns
o  Need to create a lot of patterns for every relation

Learning-‐‑based
Approaches
•  Statistical techniques and machine learning
algorithms
o  Automatically learn patterns and models for new domains
•  Some types
o  Supervised learning of patterns and rules
o  Supervised Learning for relation extraction
o  Supervised learning of Sequential Classifier Methods
o  Weakly supervised and supervised

Supervised Learning of
Palerns and Rules
•  Aiming to reduce the knowledge engineering
bottleneck to create an IE in a new domain
•  AutoSlog and PALKA à first IE pattern learning
systems
o  AutoSlog: syntactic templates, lexico-syntactic patterns and manual
review
•  Learning Algorithms à generate rules from
annotated text
o  LIEP (Huffman 1996) : syntactic paths, role fillers. Patterns that work ok in
training are kept
o  (LP)2 uses tagging rules and correction rules

Supervised Learning of
Palerns and Rules
•  Relational learning methods
o  RAPIER: rules for pre-filler, filler, and post-filler component. Each
component is a pattern that consists of words, POS tags, and semantic
classes.

Supervised Learning for
relation extraction (I)
•  Design a supervised machine learning framework
•  Decide what relations we are interested in
•  Choose what entities are relevant
•  Find (or create) labeled data
o  Representative corpus
o  Label the entities in the corpus (Automatic NER)
o  Hand label relation between these entities
o  Split into train + dev + test
•  Train, improve and evaluate

relation extraction (II)
•  Relation extraction as a classification problem
•  2 classifiers
o  To decide if two entities are related
o  To decide the class for a pair or related entities
•  Why 2?
o  Faster training by eliminating most pairs
o  Appropriate feature sets for each task
•  Find all pairs of NE (restricted to the sentence)
o  For every pair
1.  Are the entities related (classifier 1)
1.  no à END
2.  Yes à guess the class (classifier 2)

relation extraction (III)
•  Are the two entities related?
•  What is the type of relation?

relation extraction (IV)
“[American Airlines], a unit of AMR, immediately
matched the move, spokesman [Tim Wagner] said”
•  What features?
o  Head words of entity mentions and combination
•  Airlines Wagner Airlines-Wagner
o  Bag-of-words in the two entity mentions
•  American, Airlines, Tim, Wagner, American Airlines, Tim Wagner
o  Words/bigrams in particular positions to the left and right
•  M2#-1: spokesman M1#+1: said
o  Bag-of-words (or bigrams) between the 2 mentions
•  a, AMR, of, immediately, matched, move, spokesman, the, unit

relation extraction (V)
o  Named entity types
•  M1: ORG M2: PERSON
o  Entity level (Name, Nominal (NP), Pronoun)
•  M1: NAME (“it” or “he” would be PRONOUN)
•  M2: NAME (“the company” would be NOMINAL)
o  Basic chunk sequence from one entity to the other
•  NP NP PP VP NP NP
o  Constituency path on the parse tree
•  NP é NP é S é S ê NP

relation extraction (VI)
•  Trigger lists
o  For family à parent, wife, husband… (WordNet)
•  Gazetteers
o  List of countries…
•  ….
•  ….
•  …

relation extraction (VII)
•  Decide your algorithm
o  MaxEnt, Naïve Bayes, SVM
•  Train the system on the training data
•  Tune it on the dev set
•  Test on the evaluation test
o  Traditional Precision, Recall and F-score

Sequential Classiﬁer
Methods
•  IE as a classification problem using sequential
learning models.
•  A classifier is induced from annotated data to
sequentially scan a text from left to right and
decide what piece of text must be extracted or not
•  Decide what you want to extract
•  Represent the annotated data in a proper way

Methods

Methods
•  Typical steps for training
o  Get the annotated training data
o  Represent the data in IOB
o  Design feature extractors
o  Decide the algorithm to use
o  Train the models
•  Testing steps
o  Get the test documents
o  Extract features
o  Run the sequence models
o  Extract the recognized entities

Methods
•  Algorithms
o  HMM
o  CMM
o  CRF
•  Features
o  Words (current, previous, next)
o  Other linguistic information (PoS, chunks…)
o  Task specific features (NER…)
•  Word shapes: abstract representation for words

Methods
•  Algorithms
o  HMM
o  SVM
o  CRF
•  Features
o  Words (current, previous, next)
o  Other linguistic information (PoS, chunks…)
o  Task specific features (NER…)
•  Word shapes: abstract representation for words

Weakly supervised and
unsupervised
•  Manual annotation is also “expensive”
o  IE is quite domain specific à not reuse
•  AutoSlog-Ts:
o  Just needs 2 sets of documents: relevant/irrelevant
o  Syntactic templates + relevance according to relevant set
•  Ex-Disco (Yangarber et al. 2000)
o  No need preclassified corpus
o  They use a small set of patterns to decide relevant/irrelevant

unsupervised
•  OpeNER:
•  European project dealing with entity recognition,
sentiment analysis and opinion mining mainly in
hotel reviews (also restaurants, attractions, news)
•  Double propagation
o  Method to automatically gather opinion words and targets
•  From a large raw hotel corpus
•  Providing a set of seeds and patterns

unsupervised
•  Seed list
•  + à good, nice
•  - à bad, ugly
•  Patterns
•  a [EXP] [TAR]
•  the [EXP] [TAR]
•  Polarity patterns
•  = [EXP] and [EXP] [EXP], [EXP]
•  ! [EXP] but [EXP]

unsupervised
•  Propagation method
o  1) Get new targets using the seed expressions and the
patterns
•  a nice [TAR] a bad [TAR] the ugly [TAR]
•  Output à new targets (hotel, room, location)
o  2) Get new expression using the previous targets and the
patterns
•  a [EXP] hotel the [EXP] location
•  Output à new expressions (expensive, cozy, perfect…)
o  Keep running 1 and 2 to get new EXP and TAR

unsupervised
•  Polarity guessing
o  Apply the polarity patters to guess the polarity
•  = a nice(+) and cozy(?) à cozy(+)
•  ! Clean(+) but expensive(?) à expensive (-)
hlps://github.com/opener-‐‑project/opinion-‐‑domain-‐‑
lexicon-‐‑acquisition

How good is IE
•  Some progress has been done
•  Still the barrier of 60% seems difficult to outperform
•  Most errors on entities and event coreference
•  Propagation errors
o  Entity recognition à 90%
o  One event -> 4 entities
o  0.9 x 4 à 60%
•  A lot of knowledge is implicit or “common world
knowledge”

How good is IE
Information Type
Accuracy
Entities
90 – 98%
Alributes
80%
Relations
60 – 70%
Events
50 – 60%
•  Very optimistic numbers for well-established tasks
•  The numbers go down for specific/new tasks

Information Extraction

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Information Extraction

Semelhante a Information Extraction (20)

Mais de Rubén Izquierdo Beviá

Mais de Rubén Izquierdo Beviá (17)

Último

Último (20)

Information Extraction