SlideShare uma empresa Scribd logo
1 de 122
Enterprise Information Extraction

SIGMOD 2010 Tutorial
Frederick Reiss, Yunyao Li, Laura Chiticariu, and Sriram Raghavan
IBM Almaden Research Center

© 2009 IBM Corporation
Who we are
 Researchers from the Search and Analytics group
at IBM Almaden Research Center
– Frederick Reiss
– Yunyao Li
– Laura Chiticariu
– Sriram Raghavan (virtual)

 Working on information extraction since 2006-08
– SystemT project
– Code shipping with 8 IBM products

2

© 2009 IBM Corporation
Road Map

u
Yo

 What is Information Extraction? (Fred Reiss)
ere
h  Declarative Information Extraction (Fred Reiss)
re
a
 What the Declarative Approach Enables

– Scalable Infrastructure (Yunyao Li)
– Development Support (Laura Chiticariu)
 Conclusion / Q&A (Fred Reiss)

3

© 2009 IBM Corporation
Obligatory “What is Information Extraction?” Slide
 Distill structured data from unstructured and semi-structured text
 Exploit the extracted data in your applications
For years, Microsoft
Corporation CEO Bill Gates
was against open source. But
today he appears to have
changed his mind. "We can be
open source. We love the
concept of shared source,"
said Bill Veghte, a Microsoft
VP. "That's a super-important
shift for us in terms of code
access.“

Annotations
Annotations
Name
Bill Gates
Bill Veghte
Richard Stallman

Title
Organization
CEO
Microsoft
VP
Microsoft
Founder Free Soft..

Richard Stallman, founder of
the Free Software Foundation,
countered saying…
(from Cohen’s IE tutorial, 2003)

4

© 2009 IBM Corporation
Bibliography at the end of
the slide deck.

SIGMOD 2006 Tutorial [Doan06] in One Slide
 Information extraction has been an area of study in Natural
Language Processing and AI for years
 Core ideas from database research not a part of existing
work in this area
– Declarative languages
– Well-defined semantics
– Cost-based optimization
 The challenge: Can we build a “System R” for information
extraction?
 Survey of early-stage projects attacking this problem

5

© 2009 IBM Corporation
What’s new?
 New enterprise-focused applications…
 …driving new requirements…
 …leading to declarative approaches

6

© 2009 IBM Corporation
Enterprise Applications of Information Extraction
 Previous tutorial showed research prototypes
– Avatar: Semantic search on personal emails
– DBLife: Use IE to build a knowledge base about
database researchers
– AliBaba: IE over medical research papers

 Since then, IE has gone mainstream
– Enterprise Semantic Search
– Enterprise Data as a Service
– Business Intelligence
– Data-driven Enterprise Mashups

7

© 2009 IBM Corporation
Enterprise Semantic Search
 Use information extraction to improve accuracy and
presentation of search results
Extract geographical
information
Extract acronyms
and their meanings
Gumshoe (IBM)
[Zhu07,Li06]
Identify pages in
different parts of
the intranet that
are about the
same topic

8

© 2009 IBM Corporation
Enterprise Data as a Service
 Extract and clean useful information
hidden in publicly available
documents
 Rent the extracted information
over the Internet

DBLife [1]
Midas (IBM)
(Demo today!)

9

...<issuer>
...<issuer>
<issuerCik>0000070858</issuerCik>
<issuerCik>0000070858</issuerCik>
<issuerName>BANK OF AMERICA CORP /DE/</issuerName>
<issuerName>BANK OF AMERICA CORP /DE/</issuerName>
<issuerTradingSymbol>BAC</issuerTradingSymbol>
<issuerTradingSymbol>BAC</issuerTradingSymbol>
</issuer>
</issuer>
<reportingOwner>
<reportingOwner>
<reportingOwnerId>
<reportingOwnerId>
<rptOwnerCik>0001090355</rptOwnerCik>
<rptOwnerCik>0001090355</rptOwnerCik>
<rptOwnerName>THAIN JOHN A</rptOwnerName>
<rptOwnerName>THAIN JOHN A</rptOwnerName>
</reportingOwnerId>
</reportingOwnerId>
<reportingOwnerAddress>
<reportingOwnerAddress>
<rptOwnerStreet1>C/O GOLDMAN SACHS GROUP</rptOwnerStreet1>
<rptOwnerStreet1>C/O GOLDMAN SACHS GROUP</rptOwnerStreet1>
<rptOwnerStreet2>85 BROAD STREET</rptOwnerStreet2>
<rptOwnerStreet2>85 BROAD STREET</rptOwnerStreet2>
<rptOwnerCity>NEW YORK</rptOwnerCity>
<rptOwnerCity>NEW YORK</rptOwnerCity>
...
...
</reportingOwnerAddress>
</reportingOwnerAddress>
<reportingOwnerRelationship>
<reportingOwnerRelationship>
<isOfficer>1</isOfficer>
<isOfficer>1</isOfficer>
<officerTitle>Pres Glbl Bkg Sec &amp; Wlth Mgmt</officerTitle>
<officerTitle>Pres Glbl Bkg Sec &amp; Wlth Mgmt</officerTitle>
</reportingOwnerRelationship>
</reportingOwnerRelationship>
</reportingOwner> ...
</reportingOwner> ...

© 2009 IBM Corporation
Enterprise Data

Public Data

Business Intelligence

10

Social networks

Traditional
BI Tools

Blogs
Government data

Information
Extraction

Data
Warehouse

Emails
Call center records
Legacy data

New
BI Tools

Important applications
Important applications
 Marketing: Customer sentiment, brand
 Marketing: Customer sentiment, brand
management
management
 Legal: Electronic legal discovery,
 Legal: Electronic legal discovery,
identifying product pipeline problems
identifying product pipeline problems
 Strategy: Important economic events,
 Strategy: Important economic events,
monitoring competitors
monitoring competitors

© 2009 IBM Corporation
IBM eDiscovery Analyzer

Enterprise Data

Public Data

Business Intelligence

11

Social networks

Traditional
BI Tools

Blogs
Government data

Information
Extraction

Data
Warehouse

Emails
Call center records
Legacy data

New
BI Tools

Important applications
Important applications
 Marketing: Customer sentiment, brand
 Marketing: Customer sentiment, brand
management
management
 Legal: Electronic legal discovery,
 Legal: Electronic legal discovery,
identifying product pipeline problems
identifying product pipeline problems
 Strategy: Important economic events,
 Strategy: Important economic events,
monitoring competitors
monitoring competitors

© 2009 IBM Corporation
Data-Driven Mashups
 Extract structured
information from
unstructured feeds
 Join extracted information
with other structured
enterprise data

IBM Lotus Notes
Live Text

IBM InfoSphere MashupHub
[Simmen09]

12

© 2009 IBM Corporation
Enterprise Information Extraction
 IE has become increasingly important to emerging enterprise
applications
 Set of requirements driven by enterprise apps that use information
extraction
– Scalability
• Large data volumes, often orders of magnitude larger than classical NLP
corpora

– Accuracy
• Garbage-in garbage-out: Usefulness of application is often tied to quality
of extraction

– Usability
• Building an accurate IE system is labor-intensive
• Professional programmers are much more expensive than grad students!

13

© 2009 IBM Corporation
A Canonical IE System

Feature
Selection

Text

14

Entity
Identification

Features

Entity
Resolution

Entities and
Relationships

Structured
Information

© 2009 IBM Corporation
A Canonical IE System
Feature
Selection

Text

Entity
Identification

Features

Entity
Resolution

Entities and
Relationships

Structured
Information

 Boundaries between these stages are not clear-cut
 This diagram shows a simplified logical data flow
– Traditionally, physical data flow the same as logical
– But the systems we’ll talk about take a very different
approach to the actual order of execution
15

© 2009 IBM Corporation
Feature Selection
 Identify features
– Very simple, “atomic” entities
– Inputs for other stages
 Examples of features
– Dictionary match
– Regular expression match
– Part of speech
 Typical components used
– Off-the-shelf morphology package
– Many simple rules
 Very time-consuming and underappreciated

16

© 2009 IBM Corporation
Entity Identification
 Use basic features to build more complex features
– Example:
…was done by Mr. Jack Gurbingal at the…
Dictionary match:
Common first name

+

Regular expr match:
Capitalized word

=

Complex feature:
Potential person name

 Use other features to determine which of the complex
features are instances of entities and relationships
 Most information extraction research focuses on this stage
– Variety of different techniques

17

© 2009 IBM Corporation
Entity Resolution
 Perform complex analyses over entities and
relationships
 Examples
– Identify entities that refer to the same person or thing
– Join extracted information with external structured data

 Not the main focus of this tutorial
– But interacts with other parts of information extraction

18

© 2009 IBM Corporation
Obligatory Person-Phone Example

Call John Merker at 555-1212.
John also has a cell #: 555-1234

19

© 2009 IBM Corporation
Person-Phone Example: Input
Feature
Selection
Text

Entity
Identification

Features

Entity
Resolution

Entities,
Rels.

Structured
Information

Call John Merker at 555-1212.
John also has a cell #: 555-1234

20

© 2009 IBM Corporation
Person-Phone Example: Features
Feature
Selection
Text

Entity
Identification

Features

Entity
Resolution

Entities,
Rels.

Structured
Information

Call John Merker at 555-1212.
John also has a cell #: 555-1234

21

© 2009 IBM Corporation
Person-Phone Example: Entities and Relationships
Feature
Selection
Text

Entity
Entity
Identification
Identification

Features

Person

Entity
Resolution

Structured
Information

Entities,
Rels.
.

Phone

Call John Merker at 555-1212.
John also has a cell #: 555-1234
Person

22

NumType

Phone

© 2009 IBM Corporation
Person-Phone Example: Entities and Relationships
Feature
Selection
Text

Same
Same
Person
Person

Entity
Identification

Features

Person

Entity
Resolution

Structured
Information

Entities,
Rels.

Join with
Join with
office phone
office phone
directory
directory

Phone

Call John Merker at 555-1212.
John also has a cell #: 555-1234
Person

23

NumType

Phone

© 2009 IBM Corporation
Road Map
 What is Information Extraction?
are
u
Yo

ere
h

 Declarative Information Extraction
 What the Declarative Approach Enables

– Scalable Infrastructure (Yunyao Li)
– Development Support (Laura Chiticariu)
 Conclusion / Q&A (Fred Reiss)

24

© 2009 IBM Corporation
Declarative Information Extraction
 Overview of traditional approaches to information
extraction
 Practical issues with applying traditional
approaches
 How recent work has used declarative approaches
to address these issues
 Different types of declarative approaches

25

© 2009 IBM Corporation
Traditional Approaches to Information Extraction
 Two dominant types:
– Rule-Based
– Machine Learning-Based

 Distinction is based on how Entity Identification is
performed
Feature
Selection

Text

26

Entity
Identification

Features

Entity
Resolution

Entities and
Relationships

Structured
Information
© 2009 IBM Corporation
Anatomy of a Rule-Based System
Example
Documents

Feature
Selection
Rules

Feature
Selection
Text

27

Entity
Identification
Rules

Entity
Identification

Features

Entity
Resolution

Entities,
Rels.

Structured
Information
© 2009 IBM Corporation
Anatomy of a Machine Learning-Based System
Labeled
Documents

Example
Documents

Features
and
Labels

Feature
Selection

Feature
Selection
Rules

Feature
Selection
Text

28

Training

Model

Entity
Identification

Features

Entity
Resolution

Entities,
Rels.

Structured
Information
© 2009 IBM Corporation
A Brief History of IE in the NLP Community
Rule-Based
 1978-1997: MUC (Message
Understanding Conference) –
DARPA competition 1987 to 1997
– FRUMP [DeJong82]
– FASTUS [Appelt93],
– TextPro, PROTEUS
 1998: Common Pattern
Specification Language (CPSL)
standard [Appelt98]
– Standard for subsequent rulebased systems
 1999-2010: Commercial products,
GATE

Machine Learning
 At first: Simple techniques like
Naive Bayes
 1990’s: Learning Rules
– AUTOSLOG [Riloff93]
– CRYSTAL [Soderland98]
– SRV [Freitag98]
 2000’s: More specialized models
– Hidden Markov Models [Leek97]
– Maximum Entropy Markov
Models [McCallum00]
– Conditional Random Fields
[Lafferty01]
– Automatic feature expansion

For further reading:
Sunita Sarawagi’s Survey [Sarawagi08], Claire Cardie’s Survey [Cardie97]

29

© 2009 IBM Corporation
Tying the System Together: Traditional IE Frameworks
 Traditional approach:
Workflow system
– Sequence of discrete steps
– Data only flows forward
 GATE1 and UIMA2 are the most
popular frameworks
– Type systems and standard
data formats
 Web services and Hadoop also
in common use
– No standard data format
Workflow for the ANNIE system [Cunningham09]

30

1. GATE (General Architecture for Text Engineering) official web site: http://gate.ac.uk/
2. Apache UIMA (Unstructured Information Management Architecture) official web site: http://uima.apache.org/

© 2009 IBM Corporation
Sequential Execution in CPSL Rules

rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Level 2

〈Person〉 〈Token〉[~ “at”] 〈Phone〉  〈PersonPhone〉
〈Person〉 〈Token〉[~ “at”] 〈Phone〉  〈PersonPhone〉

rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in sagittis facilisis arcu auguet rum velit, sed <Person> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es

Level 1

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

oin, in <FirstName> <CapsWord> at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

〈Digits〉 〈Token〉[~ “-”] 〈Digits〉  〈Phone〉
〈Digits〉 〈Token〉[~ “-”] 〈Digits〉  〈Phone〉

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proi
enina i facilisis, <Person> at <Digits>-<Digits> arcu tincidun
orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

〈FirstName〉 〈CapsWord〉  〈Person〉
〈FirstName〉 〈CapsWord〉  〈Person〉

rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam

tus, risus in sagittis facilisis arcu augue velit, <FirstName> <CapsWord> at <Digits>-<Digits>. hendrerit faucibus pede mi ipsum.
rabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
© 2009 IBM ultrices sit
giat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, Corporation

Level 0 (Feature Selection)
31
Problems with Traditional IE Approaches
 Complex, fixed pipelines and rule sets
 Semantics tied to order of execution

Scalability

Data only flows forward, leading to
wasted work in early stages.

Accuracy

Lots of custom procedural code.

Usability
32

Hard to understand why the system
produces a particular result.
© 2009 IBM Corporation
Declarative to the Rescue!
 Define the logical constraints
between rules/components
 System determines order of
execution

Scalability

Optimizer avoids wasted work

Accuracy

More expressive rule languages;
Combine different tools easily

Usability

Describe what to extract,
instead of how to extract it

33

© 2009 IBM Corporation
What do we mean by “declarative”?
 Common vision:
– Separate semantics from order of execution
– Build the system around a language like SQL or Datalog
 Different systems have different interpretations
 Three main categories
– High-Level Declarative
• Most common approach

– Completely Declarative
– Mixed Declarative

34

© 2009 IBM Corporation
High-Level Declarative
 Replace the overall IE framework with a declarative language
 Each individual extraction component is still a “black box”
 Example 1: SQoUT[Jain08]

SQL query

Catalog of
Extraction
Modules

35

Optimizer

Query plan combines
extraction modules
with scan and index
access to data.

© 2009 IBM Corporation
High-Level Declarative
 Replace the overall IE framework with a declarative language
 Each individual extraction component is still a “black box”
 Example 1: SQoUT[Jain08]
 Example 2: PSOX[Bohannon08]

36

© 2009 IBM Corporation
High-Level Declarative
 Replace the overall IE framework with a declarative language
 Each individual extraction component is still a “black box”
 Example 1: SQoUT[Jain08]
 Example 2: PSOX[Bohannon08]
 Advantages:
– Allows use of many existing “black box” packages
– High-level performance optimizations possible
– Clear semantics for using different packages for the same task
 Drawbacks:
– Doesn’t address issues that occur within a given “black box”
– Limited opportunities for optimization, unless “black boxes” can
provide hints

37

© 2009 IBM Corporation
Completely Declarative
 One declarative language covers all stages of extraction
 Example 1: AQL language in SystemT [Chiticariu10]

-- Find all matches
-- of a dictionary
create view Name as
extract dictionary
CommonFirstName
on D.text as name
from Document D;

-- Match people with their
-- phone numbers
create view PersonPhone as
select P.name as person,
N.num as phone
from Person P, PhoneNum N
where …

Feature
Selection
Text

38

Entity
Identification

Features

-- Find pairs of references
-- to the same person
create view SamePerson as
select P1.name as name1,
P2.name as name2
from Person P1, Person P2
where …

Entity
Resolution

Entities,
Rels.

Structured
Information
© 2009 IBM Corporation
Sequential Execution in CPSL Rules

rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es
nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum
s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Level 2

〈Person〉 〈Token〉[~ “at”] 〈Phone〉  〈PersonPhone〉
〈Person〉 〈Token〉[~ “at”] 〈Phone〉  〈PersonPhone〉

rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam
tus, risus in sagittis facilisis arcu auguet rum velit, sed <Person> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus
cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es

Level 1

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

oin, in <FirstName> <CapsWord> at <Phone> amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis,
acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

〈Digits〉 〈Token〉[~ “-”] 〈Digits〉  〈Phone〉
〈Digits〉 〈Token〉[~ “-”] 〈Digits〉  〈Phone〉

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proi
enina i facilisis, <Person> at <Digits>-<Digits> arcu tincidun
orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc
volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla

〈FirstName〉 〈CapsWord〉  〈Person〉
〈FirstName〉 〈CapsWord〉  〈Person〉

rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam

tus, risus in sagittis facilisis arcu augue velit, <FirstName> <CapsWord> at <Digits>-<Digits>. hendrerit faucibus pede mi ipsum.
rabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
© 2009 IBM ultrices sit
giat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, Corporation

Level 0 (Feature Selection)
39
Declarative Semantics Example:
Identifying Musician-Instrument Relationships
(pipe | guitar | hammond organ |…)
(Person Annotator)




Instrument
Person

〈Person〉 〈0-5 tokens〉 〈Instrument〉



PersonPlaysInstrument

John

Pipe

John

Pipe

plays

the

guitar

plays the

guitar

〈Person〉 〈Person〉 〈Token〉 〈Token〉 〈Instrument〉

John Pipe plays the guitar
Person Person

Instrument

〈Person〉 〈Instrument〉 〈Token〉 〈Token〉 〈Instrument〉

John Pipe
〈Person〉

plays

〈Token〉

the

〈Token〉

guitar

〈Instrument〉

Person
Instrument

40

© 2009 IBM Corporation
Completely Declarative
 One declarative language covers all stages of extraction
 Example 1: AQL language in SystemT [Chiticariu10]
 Example 2: Conditional Random Fields in SQL [Wang10]

41

© 2009 IBM Corporation
Completely Declarative
 One declarative language covers all stages of extraction
 Example 1: AQL language in SystemT [Chiticariu10]
 Example 2: Conditional Random Fields in SQL [Wang10]
 Advantages:
– Unified language  clear semantics from top to bottom
– Optimizer has full control over low-level operations
– Can incorporate existing packages using user-defined
functions
 Drawbacks:
– Code inside UDFs doesn’t benefit from declarativeness

42

© 2009 IBM Corporation
Mixed Declarative
 Language provides declarativeness at the level of some, but
not all, of the extraction operations, both at the individual
and pipeline level
 Example: Xlog (CIMPLE) [Shen07]
This Datalog predicate
represents a large, opaque
block of extraction code.

This predicate is defined
in Datalog, using low-level
operations.

43

Extraction program for talk extracts, from [1]

© 2009 IBM Corporation
Mixed Declarative
 Language provides declarativeness at the level of some, but
not all, of the extraction operations, both at the individual
and pipeline level
 Example: Xlog (CIMPLE) [Shen08]
 Advantages:
– Ability to reuse existing “black box” packages
– Optimizer gets some flexibility to reorder low-level operations
 Drawbacks:
– Challenging to build an optimizer that does both “high-level”
and “low-level” optimizations

44

© 2009 IBM Corporation
Declarative to the Rescue!
 Different notions of declarativeness in
different systems
 All kinds address the major issues in
enterprise IE, but in different ways

Scalability

Optimizer avoids wasted work

Accuracy

More expressive rule languages;
Combine different tools easily

Usability

Describe what to extract,
instead of how to extract it

45

© 2009 IBM Corporation
Road Map
 What is Information Extraction? (Fred Reiss)
 Declarative Information Extraction (Fred Reiss)
 What the Declarative Approach Enables

Y

46

– Scalable Infrastructure (Yunyao Li)
ere
h
– Development Support (Laura Chiticariu)
re
a
ou

 Conclusion/Questions

© 2009 IBM Corporation
Scalable Infrastructure

Yunyao Li
IBM Almaden Research Center

© 2009 IBM Corporation
Declarative to the Rescue!
 Define the logical constraints
between rules/components
 System determines order of
execution

Scalability

Optimizer avoids wasted work

Accuracy

More expressive rule languages;
Combine different tools easily

Usability

Describe what to extract,
instead of how to extract it

48

© 2009 IBM Corporation
Conventional vs. Declarative IE Infrastructure
 Conventional:
– Operational semantics
and implementation are
hard-coded and
interconnected

 Declarative:
– Separate semantics from
implementation.
– Database-style design:
Optimizer + Runtime
Declarative
Declarative
Language
Language

Extraction
Extraction
Pipeline
Pipeline

49

Runtime
Runtime
Environment
Environment

Optimizer
Optimizer

Plan
Plan

Runtime
Runtime
Environment
Environment

© 2009 IBM Corporation
Why Declarative IE for Scalability
 An informal experimental
study [Reiss08]
– Collection of 4.5 million
web logs
– Band Review Annotator:
identify informal reviews
of concerts

20x faster

CPSL-based
implementation

50

Declarative
implementation
© 2009 IBM Corporation
Different Aspects of Design for Scalability
 Optimization
– Granularity
• High-level: annotator composition
• Low-level: basic extraction operators

– Strategy:
• Rewrite-based
• Cost-based

 Runtime Model
– Document-Centric vs. Collection-Centric

51

© 2009 IBM Corporation
Optimization Granularity for Declarative IE
 Annotator Composition
– Each annotator extracts one
or more entities or
relationships

 Basic Extraction Operator
– Each operator represents
an atomic extraction
operation

• E.g. Person annotator

– Black box assumption on
how an annotator works
– Optimizing composition of
extraction pipeline

High-level declarative

52

Mixed declarative

• E.g. dictionary matching,
regular expression, join,…

– System is fully aware of
how each extraction
operator works
– Optimizing each basic
extraction operator

Completely declarative
© 2009 IBM Corporation
Optimization Strategies for Declarative IE
 Rewrite-based
– Applying rewrite rules to
transform the declarative
form of the annotators to a
equivalent form that is more
efficient

 Cost-Based
– Enumerating all possible
physical execution plans,
estimate their cost, and
choose the one with the
minimum expected cost

Systems may mix these two approaches

53

© 2009 IBM Corporation
Runtime Model for Declarative IE
 Document-Centric

 Collection-Centric
Annotations

Annotated
Document
Stream

Runtime
Runtime
Environment
Environment

Runtime
Runtime
Environment
Environment
Input
Document
Stream

54

Annotations
Annotations

Document
Document
Collection
Collection

Auxiliary
Auxiliary
index
index
© 2009 IBM Corporation
Systems
 CIMPLE
 RAD
 SQout
 SystemT
 BayesStore

55

© 2009 IBM Corporation
Cimple
 Rewrite-based optimization

[Shen07]

– Inverted-index based simple pattern matching
• Shared document scan
AND

AND
AND Ullman
OR
*
P1= “(Jeff|Jeffery)ss*Ullman”
P2=“(Jeff|Jeffery)ss*Naughton”
P3=“Laurass*Haas”
P4=“Peterss*Haas”

Simple patterns

AND Naughton
OR
*

(p1)

(p2)

AND

AND

AND Haas
Lauras *

AND
Peters

s*
(p3)

Haas
*

Naughton

P2

Lauras

P3
P4

Haas

Jeffs Jefferys s*

P1

Peters

Jeffs Jefferys s*

Ullman

P3, P4

Inverted Index

s*
(p4)

Parse trees

56

© 2009 IBM Corporation
Cimple
 Pushing down text properties

[Shen07]

– Eg: To find an all-capitalized line
σallcaps(x)
lines(d,x,n)

σallcaps(x)
lines(d,x,n)
σcontainCaps(d)

Plan a

 Scoping

Plan b

[Shen07]

– Imposing location conditions on where to extract spans
• Eg: Check for names only within two lines of the occurrence of titles
Incorporating cost-model to decide how to apply the rewrite.

57

© 2009 IBM Corporation
Cimple
 Collection-centric runtime model
– Document collection (or snapshots of document collection)
– Previous extraction results

 Reusing previous extraction results

[Chen08][Chen09]

• Similar to maintaining materialized views
• Cyclex: IE program viewed as
one big blackbox [Chen08]
• Delex: IE program viewed as a
workflow of blackboxes [Chen09]

58

© 2009 IBM Corporation
RAD [Khaitan09]
 Query language: a declarative subset of CPSL specification
– Regular expressions over features and existing annotations
Query

tokenization

chunking

Sentence

Document
Document
Collection
Collection

Document
Document
Inverted index
Inverted index

Generating indexed features
• Dictionary lookup (Eg. First name)
• Part of speech lookup (Eg. Noun, verb)
• Regular expression on tokens (E.g. CapsWord, Alphanum)

Optimizer
Optimizer
Generating derived entities over the index using
series of join operators
(E.g. Person, Organization)

Document
Document
Inverted index
Inverted index
++Annotations
Annotations

Offline process

59

© 2009 IBM Corporation
RAD
 Cost-based Optimization based on Posting-list
Statistics
• E.g. ANYWORD@ANYWORD.com for Email

Another zig-zag
join over the
inverted index

R3
Zig-zag Join
over the
inverted index

R2
R1

ANYWORD

.
ANYWORD

@

Plan a

60

c

o

R4

R2

m
ANYWORD

R1
@

R3

.
c

o

m

ANYWORD

Plan b
© 2009 IBM Corporation
RAD
 Rewrite-based Optimization
– Share sub-expression evaluation
• Evaluate the same sub-expression only once

61

© 2009 IBM Corporation
Declarative to the Rescue!
 Define the logical constraints
between rules/components
 System determines order of
execution

Scalability

Optimizer avoids wasted work

Accuracy

More expressive rule languages;
Combine different tools easily

Usability

Describe what to extract,
instead of how to extract it

62

© 2009 IBM Corporation
Conventional vs. Declarative IE Infrastructure
 Conventional:
– Operational semantics
and implementation are
hard-coded and
interconnected

 Declarative:
– Separate semantics from
implementation.
– Database-style design:
Optimizer + Runtime
Declarative
Declarative
Language
Language

Extraction
Extraction
Pipeline
Pipeline

63

Runtime
Runtime
Environment
Environment

Optimizer
Optimizer

Plan
Plan

Runtime
Runtime
Environment
Environment

© 2009 IBM Corporation
Different Aspects of Design for Scalability
 Optimization
– Granularity
• High-level: annotator composition
• Low-level: basic extraction operators

– Strategy:
• Rewrite-based
• Cost-based

 Runtime Model
– Document-Centric vs. Collection-Centric

64

© 2009 IBM Corporation
Systems
 CIMPLE
 RAD
 SQout
 SystemT
 BayesStore

65

© 2009 IBM Corporation
SQoUT [Ipeirotis07][Jain07,08,09]
 Focus on composition of extraction systems
SQL Query
Entities/relations
to extract
Extraction
Extraction
System Repository
System Repository

System E0
0

Retrieval
Retrieval
Strategy
Strategy

… …

… …

Extraction

Extraction

Retrieval
Retrieval
Strategy
Strategy

System Em
m

66

Query

Data
Data
Cleaning
Cleaning

Document
Document
Collection
Collection
Extraction results

results
Extracted View
© 2009 IBM Corporation
SQoUT
 Cost-based Query Optimization
 New Plan Enumeration Strategies
– Document retrieval strategies
• Eg: filtered scan
– Running the annotator only over potentially relevant docs

– Join execution
• Independent join, outer/inner join, zig-zag join:
– Extraction results of one relation can determine the docs retrieved for
another relation.

 Efficiency vs. Quality Cost Model

Goodness

67

Quality

Efficiency

Weight
© 2009 IBM Corporation
SystemT [Reiss08] [Krishnamurthy08] [Chiticariu10]
Final
Plan
Rules
PrePreprocessor
processor

Blocks

Planner
Planner
Plan
Enumerator

Block
Plans

PostPostprocessor
processor

Cost Model
• Divide rules into
compilation blocks.
• Rewrite-based
optimization within each
block

68

• Merge block plans into a
single operator graph.
• System R Style Costbased optimization
within each block.

• Rewrite-based
optimization across
blocks.

© 2009 IBM Corporation
Example: Restricted Span Evaluation (RSE)
 Leverage the sequential nature
of text
– Join predicates on character
or token distance
 Only evaluate the inner on the
relevant portions of the
document
 Limited applicability
– Need to guarantee exact
same results

Only look for dictionary
matches in the vicinity of a
phone number.
69

John Smith at 555-1212
RSEJoin

555-1212

John Smith

Regex

Dictionary

…John Smith at 555-1212…

© 2009 IBM Corporation
Example: Shared Dictionary Matching (SDM)
 Rewrite-based optimization
– Applied to the algebraic plan during postprocessing
 Evaluate multiple dictionaries in a single pass

D1

Dict

D2

subplan
70

Dict

D1
D2

subplan

SDMDict
SDM
Dictionary
Operator
© 2009 IBM Corporation
SystemT
 Document-centric Runtime
Model:
– One document at a time
– Entities extracted are
associated with their
source document

Annotated
Document
Stream

Runtime
Runtime
Environment
Environment

Input
Document
Stream

Why one document at a time?
71

© 2009 IBM Corporation
Scaling SystemT: From Laptop to Cluster
In Lotus Notes Live Text

InCognosToro Text Analytics
Cognos Toro Analytics
Jaql Runtime

Lotus Notes
Lotus Notes
Client
Client
Email
Message

Hadoop Map-Reduce
Jaql Function Wrapper
Jaql Function Wrapper

Display
Annotated Email
SystemT
Runtime

Input
Adapter

SystemT
Runtime

Output
Adapter

Jaql Function Wrapper
Jaql Function Wrapper
Input
Adapter

SystemT
Runtime

Output
Adapter
Jaql Function Wrapper
Jaql Function Wrapper

Documents

Input
Adapter

SystemT
Runtime

Output
Adapter

Jaql Function Wrapper
Jaql Function Wrapper

Input
Adapter

SystemT Output
Runtime Jaql Function Wrapper
Adapter

Jaql Function Wrapper

Input
Adapter

SystemT
Runtime

Output
Adapter

Hadoop Cluster
72

© 2009 IBM Corporation
BayesStore [Wang10]
 Probabilistic declarative IE
– In-database machine learning for efficiency and scalability
 Text Data and Conditional Random Fields (CRF) Model
document

Token
table

73

CRF
model

Factor
table

© 2009 IBM Corporation
BayesStore
 Viterbi Inference SQL Implementation
– Implementing dynamic programming algorithm using recursive
queries

Rewrite-based
optimization.
74

© 2009 IBM Corporation
Summary
Optimization
Granularity

Optimization Strategy

Runtime Model

 [A table here shows design choices of the
Basic
Annotator
Rewrite-based
Cost-based
Document level
systems]
operator
composition

Systems
Cimple
RAD




SQoUT






BayesStore


















SystemT

75




Collection Level







© 2009 IBM Corporation
Road Map
 What is Information Extraction? (Fred Reiss)
 Declarative Information Extraction (Fred Reiss)
 What the Declarative Approach Enables
You ar

76

e here

– Scalable Infrastructure (Yunyao Li)
– Development Support (Laura Chiticariu)

© 2009 IBM Corporation
Development Support (Tooling)

Laura Chiticariu
IBM Almaden Research Center

© 2009 IBM Corporation
Declarative to the Rescue!
 Define the logical constraints
between rules/components
 System determines order of
execution

Scalability

Optimizer avoids wasted work

Accuracy

More expressive rule languages;
Combine different tools easily

Usability

Describe what to extract,
instead of how to extract it

78

© 2009 IBM Corporation
A Canonical IE System
Feature
Selection

Text

Entity
Identification

Features

Entity
Resolution

Entities and
Relationships

Structured
Information

Developing IE systems is an extremely
time-consuming, error prone process

79

© 2009 IBM Corporation
The Life Cycle of an IE System
Development

Usage / Maintenance

Develop

Use

Developer 1. Features
2. Rules / labeled data

Analyze

80

Test

Refine

User

Test

© 2009 IBM Corporation
Example 1: Explaining Extraction Results
---------------------------------------- Document Preprocessing
--------------------------------------create view Doc as
select D.text as text
from DocScan D;

------------------------------------------------------------------------------- Document Preprocessing
-- Basic Named Entity Annotators
-----------------------------------------------------------------------------create view Doc as
select D.text as text
-- Find initial words
from DocScan D;
create view InitialWord1 as

select R.match as word
-----------------------------------------from Regex(/b([p{Upper}].s*){1,5}b/, Doc.text) R
-- Basic Named Entity Annotators 10, Doc.text) R
from RegexTok(/([p{Upper}].s*){1,5}/,
----------------------------------------- added on 04/18/2008
where Not(MatchesRegex(/M.D./, R.match));

-- Find initial words

-- Yunyao: view InitialW ord1 as capture names with prefix
create added on 11/21/2008 to
(we use it asR.match as word
select initial
-- to avoid adding too many commplex rules)
--from Regex(/b([p{Upper}].s*){1,5}b/, Doc.text)
create view InitialWord2 as
R
select D.match as word
from RegexTok(/([p{Upper}].s*){1,5}/, 10,
from Dictionary('specialNamePrefix.dict', Doc.text) D;

Doc.text) R
create view InitialWord as
-- added on 04/18/2008
(select I.word as word from InitialWord1R.match));
where Not(MatchesRegex(/M.D./, I)
union all
(select I.word as word from InitialWord2 I);

-- Yunyao: added on 11/21/2008 to capture names
with prefix (we use it as initial
-- Find weak initial words
-- to avoid adding too many
create view WeakInitialWord as commplex rules)
select R.match as word ord2 as
create view InitialW
--from Regex(/b([p{Upper}].?s*){1,5}b/, Doc.text) R;
select D.match as word
from RegexTok(/([p{Upper}].?s*){1,5}/, 10, Doc.text) R
from Dictionary('specialNamePrefix.dict', Doc.text)
-D;added on 05/12/2008

-- Do not allow weak initial word to be a word longer than
three characters
create view InitialW ord as
where Not(ContainsRegex(/[p{Upper}]{3}/, R.match))
(select I.word as
-- added on 04/14/2009 word from InitialWord1 I)
union all
-- Do not allow weak initial words to match the timezon
and Not(ContainsDict('timeZone.dict', R.match)); I);
(select I.word as word from InitialWord2

------------------------------------------------ Strong Phone Numbers
-- Find weak initial words
----------------------------------------------create view W eakInitialWord as
create dictionary StrongPhoneVariantDictionary as (
select
'phone', R.match as word
--from Regex(/b([p{Upper}].?s*){1,5}b/, Doc.text)
'cell',
R;
'contact',
'direct', RegexTok(/([p{Upper}].?s*){1,5}/, 10,
from
'office',
Doc.text) R
-- Yunyao: Added new strong clues for phone numbers
-- added on 05/12/2008
'tel', Do not allow weak initial word to be a word
-'dial',
longer than three characters
'Telefon',
where
'mobile', Not(ContainsRegex(/[p{Upper}]{3}/,
R.match))
'Ph',
'Phone Number',
-- added on 04/14/2009
'Direct Line', allow weak initial words to match the
-- Do not
'Telephone
timezon No',
'TTY', Not(ContainsDict('timeZone.dict', R.match));
and
'Toll Free',
'Toll-free',
------------------------------------------------ German
-- Strong Phone Numbers
'Fon',
----------------------------------------------'Telefon Geschaeftsstelle',
'Telefon Geschäftsstelle',
create dictionary StrongPhoneVariantDictionary as (
'Telefon Zweigstelle',
'phone',
'Telefon Hauptsitz',
'cell',
'Telefon (Geschaeftsstelle)',
'contact',
'Telefon (Geschäftsstelle)',
'direct',
'Telefon (Zweigstelle)',
'office',
'Telefon (Hauptsitz)',
-- Yunyao: Added new strong clues for phone
'Telefonnummer',
numbers
'Telefon Geschaeftssitz',
'Telefon Geschäftssitz',
'tel',
'Telefon (Geschaeftssitz)',
'dial',
'Telefon (Geschäftssitz)',
'Telefon',
'Telefon Persönlich',
'mobile',
'Telefon persoenlich',
'Ph',
'Telefon (Persönlich)',
'Phone Number',
'Telefon (persoenlich)',
'Direct
'Handy', Line',
'Handy-Nummer',
'Telephone No',
'Telefon arbeit',
'TTY',
'TelefonFree',
'Toll (arbeit)'
);
'Toll-free',

create view Initial as
--'Junior' (Yunyao: comments out to avoid mismatches such as Junior National [team player],
-- If we can have large negative dictionary to eliminate such mismatches,
-- then this may be recovered
--'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name')
-- for German names
-- TODO: need further test
,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor',
'Herr Professor', 'Frau professor', 'Baron', 'graf'

-- Find dictionary matches for all title initials

create view LastName as
select C.lastname as lastname
--from Consolidate(ValidLastNameAll.lastname) C;
from ValidLastNameAll C
consolidate on C.lastname;

select D.match as initial
--'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name')
-- for German names
-- TODO: need further test
,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor',
'Herr Professor', 'Frau professor', 'Baron', 'graf'
);

-- Find dictionary matches for all first names
-- Mostly US first names
create view StrictFirstName1 as
select D.match as firstname
from Dictionary('strictFirst.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

);

-- German first names
create view StrictFirstName2 as
select D.match as firstname
from Dictionary('strictFirst_german.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

-- Find dictionary matches for all title initials
from Dictionary('InitialDict', Doc.text) D;
-- Yunyao: added 05/09/2008 to capture person name suffix
create dictionary PersonSuffixDict as
(
',jr.', ',jr', 'III', 'IV', 'V', 'VI'
);
create view PersonSuffix as
select D.match as suffix
from Dictionary('PersonSuffixDict', Doc.text) D;
-- Find capitalized words that look like person names and not in the non-name dictionary
create view CapsPersonCandidate as
select R.match as name
--from Regex(/bp{Upper}p{Lower}[p{Alpha}]{1,20}b/, Doc.text) R
--from Regex(/bp{Upper}p{Lower}[p{Alpha}]{0,10}(['-][p{Upper}])?[p{Alpha}]{1,10}b/, Doc.text) R
-- change to enable unicode match
--from Regex(/bp{Lu}p{M}*[p{Ll}p{Lo}]p{M}*[p{L}p{M}*]{0,10}(['-][p{Lu}p{M}*])?[p{L}p{M}*]{1,10}b/, Doc.text) R
--from Regex(/bp{Lu}p{M}*[p{Ll}p{Lo}]p{M}*[p{L}p{M}*]{0,10}(['-][p{Lu}p{M}*])?(p{L}p{M}*){1,10}b/, Doc.text) R
-- Allow fully capitalized words
--from Regex(/bp{Lu}p{M}*(p{L}p{M}*){0,10}(['-][p{Lu}p{M}*])?(p{L}p{M}*){1,10}b/, Doc.text) R
from RegexTok(/p{Lu}p{M}*(p{L}p{M}*){0,10}(['-][p{Lu}p{M}*])?(p{L}p{M}*){1,10}/, 4, Doc.text) R --'
where Not(ContainsDicts(
'FilterPersonDict',
'filterPerson_position.dict',
'filterPerson_german.dict',
'InitialDict',
'StrongPhoneVariantDictionary',
'stateList.dict',
'organization_suffix.dict',
'industryType_suffix.dict',
'streetSuffix_forPerson.dict',
'wkday.dict',
'nationality.dict',
'stateListAbbrev.dict',
'stateAbbrv.ChicagoAPStyle.dict', R.match));
create view CapsPerson as
select C.name as name
from CapsPersonCandidate C
where Not(MatchesRegex(/(p{Lu}p{M}*)+-.*([p{Ll}p{Lo}]p{M}*).*/, C.name))
and Not(MatchesRegex(/.*([p{Ll}p{Lo}]p{M}*).*-(p{Lu}p{M}*)+/, C.name));

create view CapsPersonNoP as
select CP.name as name
from CapsPerson CP
where Not(ContainsRegex(/'/, CP.name)); --'

create dictionary InitialDict as
( 'Pro','Bono','Enterprises','Group','Said','Says','Assista
nt','Vice','Warden','Contribution',
'rev.', 'col.', 'reverend', 'prof.', 'professor.',
'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.', 'Sales',
'Research', 'Development', 'Product',
'messrs.', 'dr.', 'master.', 'marquis', 'monsieur',
'Support', 'Manager', 'Telephone', 'Phone', 'Contact',
'ds', 'di'
'Information',
--'Dear' (Yunyao: comments out to avoid mismatches such as
'Electronics','Managed','West','East','North','South',
Dear Member),
'Teaches','Ministry', 'Church', avoid mismatches such
--'Junior' (Yunyao: comments out to'Association',
as'Laboratories', [team player],
Junior National 'Living', 'Community', 'Visiting',
-- 'Officer', have large negative'Only', 'Additionally', such
If we can 'After', 'Pls', 'FYI', dictionary to eliminate
mismatches, 'Acquire', 'Addition', 'America',
'Adding',
-- then this phrases that are likely to be at the start of a
-- short may be recovered

sentence
'Yes', 'No', 'Ja', 'Nein','Kein', 'Keine', 'Gegenstimme',
-- TODO: to be double checked
'Another', 'Anyway','Associate', 'At', 'Athletes', 'It',
'Enron', 'EnronXGate', 'Have', 'However',
'Company', 'Companies', 'IBM','Annual',
-- common verbs appear with person names in
financial reports
-- ideally we want to have a general comprehensive
verb list to use as a filter dictionary
'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees',
'Warns', 'Announces', 'Reviews'
-- Laura 06/02/2009: new filter dict for title for SEC
domain in filterPerson_title.dict
);
create dictionary GreetingsDict as
(
'Hey', 'Hi', 'Hello', 'Dear',
-- German greetings
'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo',
-- Italian
'Ciao',
-- Spanish
'Hola',
-- French
'Bonjour'
);

81

create dictionary InitialDict as
(
'rev.', 'col.', 'reverend', 'prof.', 'professor.',
'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.',
'messrs.', 'dr.', 'master.', 'marquis', 'monsieur',
'ds', 'di'
--'Dear' (Yunyao: comments out to avoid
mismatches such as Dear Member),

-- Spain first name from blue pages
create view StrictFirstName7 as
select D.match as firstname
from Dictionary('names/strictFirst_spain.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

--============================================================
-- Find strict capitalized words
--create view StrictCapsPerson as
create view StrictCapsPerson as
select R.name as name
from StrictCapsPersonR R
where MatchesRegex(/bp{Lu}p{M}*[p{Ll}p{Lo}]p{M}*(p{L}p{M}*){1,20}b/, R.name);
-- Find dictionary matches for all last names
create view StrictLastName1 as
select D.match as lastname
from Dictionary('strictLast.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);

create view StrictLastName3 as
select D.match as lastname
from Dictionary('strictLast_german_bluePages.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName4 as
select D.match as lastname
from Dictionary('uniqMostCommonSurname.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);

create view StrictLastName6 as
select D.match as lastname
from Dictionary('names/strictLast_france.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName7 as
select D.match as lastname
from Dictionary('names/strictLast_spain.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName8 as
select D.match as lastname
from Dictionary('names/strictLast_india.partial.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName9 as
select D.match as lastname
from Dictionary('names/strictLast_israel.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);
create view StrictLastName as
(select S.lastname as lastname from StrictLastName1 S)
union all
(select S.lastname as lastname from StrictLastName2 S)
union all
(select S.lastname as lastname from StrictLastName3 S)
union all
(select S.lastname as lastname from StrictLastName4 S)
union all
(select S.lastname as lastname from StrictLastName5 S)
union all
(select S.lastname as lastname from StrictLastName6 S)
union all
(select S.lastname as lastname from StrictLastName7 S)
union all
(select S.lastname as lastname from StrictLastName8 S)
union all
(select S.lastname as lastname from StrictLastName9 S);
-- Relaxed version of last name
create view RelaxedLastName1 as
select CombineSpans(SL.lastname, CP.name) as lastname
from StrictLastName SL,
StrictCapsPerson CP
where FollowsTok(SL.lastname, CP.name, 1, 1)
and MatchesRegex(/-/, SpanBetween(SL.lastname, CP.name));
create view RelaxedLastName2 as
select CombineSpans(CP.name, SL.lastname) as lastname
from StrictLastName SL,
StrictCapsPerson CP
where FollowsTok(CP.name, SL.lastname, 1, 1)
and MatchesRegex(/-/, SpanBetween(CP.name, SL.lastname));
-- all the last names
create view LastNameAll as
(select N.lastname as lastname from StrictLastName N)
union all
(select N.lastname as lastname from RelaxedLastName1 N)
union all
(select N.lastname as lastname from RelaxedLastName2 N);

from Dictionary('names/name_israel.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

from FirstName FN,
InitialWord IW,
CapsPerson CP
where FollowsTok(FN.firstname, IW.word, 0, 0)
and FollowsTok(IW.word, CP.name, 0, 0);

create view NamesAll as
(select P.name as name from NameDict P)
union all
(select P.name as name from NameDict1 P)
union all
(select P.name as name from NameDict2 P)
union all
(select P.name as name from NameDict3 P)
union all
(select P.name as name from NameDict4 P)
union all
(select P.firstname as name from FirstName P)
union all

/**
* Translation for Rule 3r2
*
* This relaxed version of rule '3' will find person names like
Thomas B.M . David
* But it only insists that the second word is in the person
dictionary
*/
/*
<rule annotation=Person id=3r2>
<internal>
<token attribute={etc}>CAPSPERSON</token>
<token attribute={etc}>INITIALWORD</token>
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token>
</internal>
</rule>*/

create view PersonDict as
select C.name as name
--from Consolidate(NamesAll.name) C;
from NamesAll C
consolidate on C.name;

create view Person3r2 as
select CombineSpans(CP.name, LN.lastname) as person
from LastName LN,
InitialWord IW,
CapsPerson CP
where FollowsTok(CP.name, IW.word, 0, 0)
and FollowsTok(IW.word, LN.lastname, 0, 0);

--==========================================================
-- Actual Rules
--==========================================================

/**
* Translation for Rule 4
*
* This rule will find person names like David Thomas
*/
/*
<rule annotation=Person id=4>
<internal>
<token
attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token>
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person4WithNewLine as
select CombineSpans(FN.firstname, LN.lastname) as person
from FirstName FN,
LastName LN
where FollowsTok(FN.firstname, LN.lastname, 0, 0);

-- For 3-part Person names
create view Person3P1 as
select CombineSpans(F.firstname, L.lastname) as person
from StrictFirstName F,
StrictCapsPersonR S,
StrictLastName L
where FollowsTok(F.firstname, S.name, 0, 0)
--and FollowsTok(S.name, L.lastname, 0, 0)
and FollowsTok(F.firstname, L.lastname, 1, 1)
and Not(Equals(GetText(F.firstname), GetText(L.lastname)))
and Not(Equals(GetText(F.firstname), GetText(S.name)))
and Not(Equals(GetText(S.name), GetText(L.lastname)))
and Not(ContainsRegex(/[nrt]/, SpanBetween(F.firstname, L.lastname)));
create view Person3P2 as
select CombineSpans(P.name, L.lastname) as person
from PersonDict P,
StrictCapsPersonR S,
StrictLastName L
where FollowsTok(P.name, S.name, 0, 0)
--and FollowsTok(S.name, L.lastname, 0, 0)
and FollowsTok(P.name, L.lastname, 1, 1)
and Not(Equals(GetText(P.name), GetText(L.lastname)))
and Not(Equals(GetText(P.name), GetText(S.name)))
and Not(Equals(GetText(S.name), GetText(L.lastname)))
and Not(ContainsRegex(/[nrt]/, SpanBetween(P.name, L.lastname)));

-- Yunyao: 05/20/2008 revised to Person4WrongCandidates due
to performance reason
-- NOTE: current optimizer execute Equals first thus make
Person4Wrong very expensive
--create view Person4Wrong as
--select CombineSpans(FN.firstname, LN.lastname) as person
--from FirstName FN,
-LastName LN
--where FollowsTok(FN.firstname, LN.lastname, 0, 0)
-- and ContainsRegex(/[nr]/, SpanBetween(FN.firstname,
LN.lastname))
-- and Equals(GetText(FN.firstname), GetText(LN.lastname));

create view Person3P3 as
select CombineSpans(F.firstname, P.name) as person
from PersonDict P,
StrictCapsPersonR S,
StrictFirstName F
where FollowsTok(F.firstname, S.name, 0, 0)
--and FollowsTok(S.name, P.name, 0, 0)
and FollowsTok(F.firstname, P.name, 1, 1)
and Not(Equals(GetText(P.name), GetText(F.firstname)))
and Not(Equals(GetText(P.name), GetText(S.name)))
and Not(Equals(GetText(S.name), GetText(F.firstname)))
and Not(ContainsRegex(/[nrt]/, SpanBetween(F.firstname, P.name)));

create view Person4WrongCandidates as
select FN.firstname as firstname, LN.lastname as lastname
from FirstName FN,
LastName LN
where FollowsTok(FN.firstname, LN.lastname, 0, 0)
and ContainsRegex(/[nr]/, SpanBetween(FN.firstname,
LN.lastname));

/**
* Translation for Rule 1
* Handles names of persons like Mr. Vladimir E. Putin
*/
/*
<rule annotation=Person id=1>
<token attribute={etc}INITIAL{etc}>CANYWORD</token>
<internal>
<token attribute={etc}>CAPSPERSON</token>
<token attribute={etc}>INITIALW ORD</token>
<token attribute={etc}>CAPSPERSON</token>
</internal>
</rule>
*/

SystemT’s Person extractor
SystemT’s Person extractor

create view StrictCapsPersonR as
select R.match as name
--from Regex(/bp{Lu}p{M}*(p{L}p{M}*){1,20}b/, CapsPersonNoP.name) R;
from RegexTok(/p{Lu}p{M}*(p{L}p{M}*){1,20}/, 1, CapsPersonNoP.name) R;

create view StrictLastName5 as
select D.match as lastname
from Dictionary('names/strictLast_italy.dict', Doc.text) D
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);

-- new entries

-- France first name from blue pages
create view StrictFirstName6 as
select D.match as firstname
from Dictionary('names/strictFirst_france.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

-- Israel first name from blue pages
create view StrictFirstName9 as
select D.match as firstname
from Dictionary('names/strictFirst_israel.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

'Pro','Bono','Enterprises','Group','Said','Says','Assistant','Vice
'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My',
','Warden','Contribution',
'His','Her',
'Research', 'Development', 'Product', 'Sales', 'Support',
'Their','Popcorn', 'Name', 'July', 'June','Join',
'Manager', 'Telephone', 'Phone', 'Contact', 'Information',
'Business', 'Administrative', 'South', 'Members',
'Electronics','Managed','West','East','North','South',
'Address', 'Please', 'List',
'Teaches','Ministry', 'Church', 'Association', 'Laboratories',
'Public', 'Inc', 'Parkway',
'Living', 'Community', 'Visiting', 'Brother', 'Buy', 'Then',
'Officer', 'After', 'Pls', 'FYI', 'Only', 'Additionally', 'Adding',
'Services', 'Statements',
'Acquire', 'Addition', 'America', 'Commissioner',
'President', 'Governor',
-- short phrases that are likely to be at the start of a sentence
'Commitment', 'Commits', 'Hey',
'Yes', 'No', 'Ja','End', 'Exit', 'Experiences', 'Finance',
'Director', 'Nein','Kein', 'Keine', 'Gegenstimme',
-- TODO: to be double checked
'Elementary', 'W ednesday', 'At', 'Athletes', 'It', 'Enron',
'Another', 'Anyway','Associate',
'Nov', 'Infrastructure', 'Inside', 'Convention',
'EnronXGate', 'Have', 'However',
'Judge', 'Lady', 'Friday', 'Project',
'Company', 'Companies', 'IBM','Annual', 'Projected',
'Recalls', 'Regards', 'Recently', 'Administration',
-- common verbs appear with person names in financial
reports
'Independence', 'Denied',
-- ideally we want to have a general comprehensive verb list
'Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike',
to 'W as', a filter dictionary
use as 'Were', 'Secretary',
'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees',
'Speaker', 'Chairman', 'Consider', 'Consultant',
'Warns', 'Announces', 'Reviews'
'County', 'Court', 'Defensive',
-- Laura 06/02/2009: new filter dict for title for SEC domain in
'Northwestern',
filterPerson_title.dict 'Place', 'Hi', 'Futures', 'Athlete',
); 'Invitational', 'System',

'International', 'Main', 'Online', 'Ideally'

-- Italy first name from blue pages
create view StrictFirstName5 as
select D.match as firstname
from Dictionary('names/strictFirst_italy.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

--============================================================
--TODO: need to think through how to deal with hypened name
-- one way to do so is to run Regex(pattern, CP.name) and enforce CP.name does not contain '
-- need more testing before confirming the change

create view StrictLastName2 as
select D.match as lastname
from Dictionary('strictLast_german.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match);

create dictionary GreetingsDict as
-- more entries
(
,'If','Our', 'About', 'Analyst', 'On', 'Of', 'By', 'HR',
'Hey', 'Hi', 'Hello', 'Dear',
'Mkt', 'Pre', 'Post',
-- German greetings 'Ice', 'Surname', 'Lastname',
'Condominium',
'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo',
'firstname', 'Name', 'familyname',
-- Italian
-- Italian greeting
'Ciao',
'Ciao',
-- Spanish
'Hola',
-- Spanish greeting
-- French
'Hola',
'Bonjour'
-- French greeting
); 'Bonjour',

-- german first name from blue page
create view StrictFirstName4 as
select D.match as firstname
from Dictionary('strictFirst_german_bluePages.dict', Doc.text)
D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

-- Find strict capitalized words with two letter or more (relaxed version of StrictCapsPerson)

'President', 'Governor', 'Commissioner', 'Commitment',
--include 'core/GenericNE/Person.aql';
'Commits', 'Hey',
'Director', 'End', 'Exit', 'Experiences', 'Finance',
'Elementary', 'Wednesday',
'Nov', 'Infrastructure', 'Inside', 'Convention',
'Judge', 'Lady', 'Friday', 'Project', 'Projected',
create dictionary FilterPersonDict as
'Recalls', 'Regards', 'Recently', 'Administration',
(
'Independence', 'Denied',
'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher',
'Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike', 'Was',
'All','Tell',
'Were', 'Secretary',
'Speaker', 'Chairman', 'Consider', 'Consultant', 'County',
'Friends', 'Friend', 'Colleague', 'Colleagues',
'Court', 'Defensive',
'Managers','If',
'Northwestern', 'Place', 'Hi', 'Futures', 'Athlete', 'Invitational',
'Customer', 'Users', 'User', 'Valued', 'Executive',
'System',
'Chairs',
'International', 'Main', 'Online', 'Ideally'
'New', 'Owner', 'Conference', 'Please', 'Outlook',
-- more entries
'Lotus', 'Notes', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt', 'Pre',
,'If','Our', 'About',
'This', 'That', 'There', 'Here', 'Subscribers', 'W hat',
'Post',
'W hen', 'Where', 'Which',
'Condominium', 'Ice', 'Surname', 'Lastname', 'firstname',
'Name', 'familyname', 'Thanks', 'Thanksgiving','Senator',
'W ith', 'While',
-- Italian greeting
'Platinum', 'Perspective',
'Ciao',
'Manager', 'Ambassador', 'Professor', 'Dear',
-- Spanish greeting 'Athelet',
'Contact', 'Cheers',
'Hola',
'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center',
-- French greeting
'The', 'Take', 'Junior',
'Bonjour',
'Both', 'Communities', 'Greetings', 'Hope',
-- new entries

'Restaurants', 'Properties',

-- nick names for US first names
create view StrictFirstName3 as
select D.match as firstname
from Dictionary('strictNickName.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

-- Indian first name from blue pages
-- TODO: still need to clean up the remaining entries
create view StrictFirstName8 as
select D.match as firstname
from Dictionary('names/strictFirst_india.partial.dict', Doc.text)
D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);

-- German
--include 'core/GenericNE/Person.aql';

'Fon',
'Telefon Geschaeftsstelle',
'Telefon Geschäftsstelle',
create dictionary FilterPersonDict as
'Telefon Zweigstelle',
(
'Telefon Hauptsitz',
'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', 'All','Tell',
'Telefon (Geschaeftsstelle)',
'Friends', 'Friend', 'Colleague', 'Colleagues', 'Managers','If',
'Telefon (Geschäftsstelle)',
'Customer', 'Users', 'User', 'Valued', 'Executive', 'Chairs',
'Telefon (Zweigstelle)',
'New', 'Owner', 'Conference', 'Please', 'Outlook', 'Lotus',
'Telefon (Hauptsitz)',
'Notes',
'Telefonnummer',
'This', 'That', 'There', 'Here', 'Subscribers', 'What', 'When',
'Where', 'Which',
'Telefon Geschaeftssitz',
'With', 'While', 'Thanks', 'Thanksgiving','Senator', 'Platinum',
'Telefon Geschäftssitz',
'Perspective', (Geschaeftssitz)',
'Telefon
'Manager', 'Ambassador', 'Professor', 'Dear', 'Contact',
'Telefon (Geschäftssitz)',
'Cheers', 'Athelet',
'Telefon Persönlich',
'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', 'The', 'Take',
'Telefon persoenlich',
'Junior',
'Telefon (Persönlich)',
'Both', 'Communities', 'Greetings', 'Hope', 'Restaurants',
'Properties', (persoenlich)',
'Telefon
'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My', 'His','Her',
'Handy',
'Their','Popcorn', 'Name', 'July', 'June','Join',
'Handy-Nummer',
'Business', 'Administrative', 'South', 'Members', 'Address',
'Telefon arbeit',
'Please', 'List',(arbeit)'
'Telefon
'Public', 'Inc', 'Parkway', 'Brother', 'Buy', 'Then', 'Services',
);
'Statements',

--------------------------------------create view ValidLastNameAll as
select N.lastname as lastname
from LastNameAll N
-- do not allow partially all capitalized words
where Not(MatchesRegex(/(p{Lu}p{M}*)
+-.*([p{Ll}p{Lo}]p{M}*).*/, N.lastname))
and Not(MatchesRegex(/.*([p{Ll}p{Lo}]p{M}*).*(p{Lu}p{M}*)+/, N.lastname));

-- union all the dictionary matches for first names
create view StrictFirstName as
(select S.firstname as firstname from StrictFirstName1 S)
union all
(select S.firstname as firstname from StrictFirstName2 S)
union all
(select S.firstname as firstname from StrictFirstName3 S)
union all
(select S.firstname as firstname from StrictFirstName4 S)
union all
(select S.firstname as firstname from StrictFirstName5 S)
union all
(select S.firstname as firstname from StrictFirstName6 S)
union all
(select S.firstname as firstname from StrictFirstName7 S)
union all
(select S.firstname as firstname from StrictFirstName8 S)
union all
(select S.firstname as firstname from StrictFirstName9 S);

-- Relaxed versions of first name
create view RelaxedFirstName1 as
select CombineSpans(S.firstname, CP.name) as firstname
from StrictFirstName S,
StrictCapsPerson CP
where FollowsTok(S.firstname, CP.name, 1, 1)
and MatchesRegex(/-/, SpanBetween(S.firstname, CP.name));

create view Person1 as
select CombineSpans(CP1.name, CP2.name) as person
from Initial I,
CapsPerson CP1,
InitialWord IW ,
CapsPerson CP2
where FollowsTok(I.initial, CP1.name, 0, 0)
and FollowsTok(CP1.name, IW.word, 0, 0)
and FollowsTok(IW .word, CP2.name, 0, 0);
--and Not(ContainsRegex(/[nr]/, SpanBetween(I.initial, CP2.name)));

-- all the first names
create view FirstNameAll as
(select N.firstname as firstname from StrictFirstName N)
union all
(select N.firstname as firstname from RelaxedFirstName1 N)
union all
(select N.firstname as firstname from RelaxedFirstName2 N);
create view ValidFirstNameAll as
select N.firstname as firstname
from FirstNameAll N
where Not(MatchesRegex(/(p{Lu}p{M}*)
+-.*([p{Ll}p{Lo}]p{M}*).*/, N.firstname))
and Not(MatchesRegex(/.*([p{Ll}p{Lo}]p{M}*).*(p{Lu}p{M}*)+/, N.firstname));
create view FirstName as
select C.firstname as firstname
--from Consolidate(ValidFirstNameAll.firstname) C;
from ValidFirstNameAll C
consolidate on C.firstname;
-- Combine all dictionary matches for both last names and first
names
create view NameDict as
select D.match as name
from Dictionary('name.dict', Doc.text) D
--where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/,
D.match);
--where MatchesRegex(/p{Upper}.{1,20}/, D.match);
-- changed to enable unicode match
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
create view NameDict1 as
select D.match as name
from Dictionary('names/name_italy.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
create view NameDict2 as
select D.match as name
from Dictionary('names/name_france.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
create view NameDict3 as
select D.match as name
from Dictionary('names/name_spain.dict', Doc.text) D
where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match);
create view NameDict4 as
select D.match as name

-- relaxed version of Rule4a
-- Yunyao: split the following rules into two to improve
performance
-- TODO: Test case for optimizer
-- create view Person4ar1 as
-- select CombineSpans(CP.name, FN.firstname) as person
--from FirstName FN,
-CapsPerson CP
--where FollowsTok(CP.name, FN.firstname, 1, 1)
--and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname))
--and Not(M atchesRegex(/(.|n|r)*(.|?|!|'|sat|sin)( )*/,
LeftContext(CP.name, 10)))
--and Not(M atchesRegex(/(?i)(.+fully)/, CP.name))
--and GreaterThan(GetBegin(CP.name), 10);

/**
* Translation for Rule 1a
* Handles names of persons like Mr. Vladimir Putin
*/
/*
<rule annotation=Person id=1a>
<token attribute={etc}INITIAL{etc}>CANYWORD</token>
<internal>
<token attribute={etc}>CAPSPERSON</token>{1,3}
</internal>
</rule>*/

~250 AQL rules
~250 AQL rules
create view RelaxedFirstName2 as
select CombineSpans(CP.name, S.firstname) as firstname
from StrictFirstName S,
StrictCapsPerson CP
where FollowsTok(CP.name, S.firstname, 1, 1)
and MatchesRegex(/-/, SpanBetween(CP.name, S.firstname));

create view Person4ar1temp as
select FN.firstname as firstname, CP.name as name
from FirstName FN,
CapsPerson CP
where FollowsTok(CP.name, FN.firstname, 1, 1)
and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname));

-- Split into two rules so that single token annotations are serperated from others
-- Single token annotations
create view Person1a1 as
select CP1.name as person
from Initial I,
CapsPerson CP1
where FollowsTok(I.initial, CP1.name, 0, 0)
--- start changing this block
--- disallow allow newline
and Not(ContainsRegex(/[nt]/,SpanBetween(I.initial,CP1.name)))
--- end changing this block
;
-- Yunyao: added 05/09/2008 to match patterns such as "Mr. B. B. Buy"
/*
create view Person1a2 as
select CombineSpans(name.block, CP1.name) as person
from Initial I,
BlockTok(0, 1, 2, InitialW ord.word) name,
CapsPerson CP1
where FollowsTok(I.initial, name.block, 0, 0)
and FollowsTok(name.block, CP1.name, 0, 0)
and Not(ContainsRegex(/[nt]/,CombineSpans(I.initial, CP1.name)));
*/
create view Person1a as
-- (
select P.person as person from Person1a1 P
-- )
-- union all
-- (select P.person as person from Person1a2 P)
;
/*
create view Person1a_more as
select name.block as person
from Initial I,
BlockTok(0, 2, 3, CapsPerson.name) name
where FollowsTok(I.initial, name.block, 0, 0)
and Not(ContainsRegex(/[nt]/,name.block))
--- start changing this block
-- disallow newline
and Not(ContainsRegex(/[nt]/,SpanBetween(I.initial,name.block)))
--- end changing this block
;
*/
/**
* Translation for Rule 3
* Find person names like Thomas B.M. David
*/
/*
<rule annotation=Person id=3>
<internal>
<token attribute={etc}PERSON{etc}>CAPSPERSON</token>
<token attribute={etc}>INITIALW ORD</token>
<token attribute={etc}PERSON{etc}>CAPSPERSON</token>
</internal>
</rule>*/
create view Person3 as
select CombineSpans(P1.name, P2.name) as person
from PersonDict P1,
--InitialW ord IW,
WeakInitialWord IW ,
PersonDict P2
where FollowsTok(P1.name, IW .word, 0, 0)
and FollowsTok(IW .word, P2.name, 0, 0)
and Not(Equals(GetText(P1.name), GetText(P2.name)));
/**
* Translation for Rule 3r1
*
* This relaxed version of rule '3' will find person names like Thomas B.M. David
* But it only insists that the first word is in the person dictionary
*/
/*
<rule annotation=Person id=3r1>
<internal>
<token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token>
<token attribute={etc}>INITIALW ORD</token>
<token attribute={etc}>CAPSPERSON</token>
</internal>
</rule>
*/

create view Person4 as
(select P.person as person from Person4WithNewLine P)
minus
(select CombineSpans(P.firstname, P.lastname) as person
from Person4WrongCandidates P
where Equals(GetText(P.firstname), GetText(P.lastname)));
/**
* Translation for Rule4a
* This rule will find person names like Thomas, David
*/
/*
<rule annotation=Person id=4a>
<internal>
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token>
<token attribute={etc}>,</token>
<token
attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person4a as
select CombineSpans(LN.lastname, FN.firstname) as person
from FirstName FN,
LastName LN
where FollowsTok(LN.lastname, FN.firstname, 1, 1)
and ContainsRegex(/,/,SpanBetween(LN.lastname,
FN.firstname));

create view Person4ar1 as
select CombineSpans(P.name, P.firstname) as person
from Person4ar1temp P
where Not(MatchesRegex(/(.|n|r)*(.|?|!|'|sat|sin)( )*/,
LeftContext(P.name, 10))) --'
and Not(MatchesRegex(/(?i)(.+fully)/, P.name))
and GreaterThan(GetBegin(P.name), 10);
create view Person4ar2 as
select CombineSpans(LN.lastname, CP.name) as person
from CapsPerson CP,
LastName LN
where FollowsTok(LN.lastname, CP.name, 0, 1)
and ContainsRegex(/,/,SpanBetween(LN.lastname, CP.name));
/**
* Translation for Rule2
*
* This rule will handles names of persons like B.M . Thomas
David, where Thomas occurs in some person dictionary
*/
/*
<rule annotation=Person id=2>
<internal>
<token attribute={etc}>INITIALWORD</token>
<token attribute={etc}PERSON{etc}>CAPSPERSON</token>
<token attribute={etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person2 as
select CombineSpans(IW.word, CP.name) as person
from InitialWord IW,
PersonDict P,
CapsPerson CP
where FollowsTok(IW.word, P.name, 0, 0)
and FollowsTok(P.name, CP.name, 0, 0);
/**
* Translation for Rule 2a
*
* The rule handles names of persons like B.M . Thomas David,
where David occurs in some person dictionary
*/
/*
<rule annotation=Person id=2a>
<internal>
<token attribute={etc}>INITIALWORD</token>
<token attribute={etc}>CAPSPERSON</token>
<token attribute={etc}>NEWLINE</token>?
<token attribute={etc}PERSON{etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person2a as
select CombineSpans(IW.word, P.name) as person
from InitialWord IW,
CapsPerson CP,
PersonDict P
where FollowsTok(IW.word, CP.name, 0, 0)
and FollowsTok(CP.name, P.name, 0, 0);

/*

<rule annotation=Person id=4r1>
<internal>
<token
attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</toke
n>
<token attribute={etc}>NEWLINE</token>?
<token attribute={etc}>CAPSPERSON</token>
</internal>
</rule>
*/
create view Person4r1 as
select CombineSpans(FN.firstname, CP.name) as person
from FirstName FN,
CapsPerson CP
where FollowsTok(FN.firstname, CP.name, 0, 0);
/**
* Translation for Rule 4r2
*
* This relaxed version of rule '4' will find person
names Thomas, David
* But it only insists that the SECOND word is in some person
dictionary
*/
/*
<rule annotation=Person id=4r2>
<token attribute={etc}>ANYWORD</token>
<internal>
<token attribute={etc}>CAPSPERSON</token>
<token attribute={etc}>NEWLINE</token>?
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</toke
n>
</internal>
</rule>
*/
create view Person4r2 as
select CombineSpans(CP.name, LN.lastname) as person
from CapsPerson CP,
LastName LN
where FollowsTok(CP.name, LN.lastname, 0, 0);
/**
* Translation for Rule 5
*
* This rule will find other single token person first names
*/
/*
<rule annotation=Person id=5>
<internal>
<token attribute={etc}>INITIALWORD</token>?
<token
attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</toke
n>
</internal>
</rule>
*/
create view Person5 as
select CombineSpans(IW.word, FN.firstname) as person
from InitialWord IW,
FirstName FN
where FollowsTok(IW.word, FN.firstname, 0, 0);
/**
* Translation for Rule 6
*
* This rule will find other single token person last names
*/
/*
<rule annotation=Person id=6>
<internal>
<token attribute={etc}>INITIALWORD</token>?
<token
attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</toke
n>
</internal>
</rule>
*/
create view Person6 as
select CombineSpans(IW.word, LN.lastname) as person
from InitialWord IW,
LastName LN
where FollowsTok(IW.word, LN.lastname, 0, 0);
-=================================================
=========
-- End of rules
--- Create final list of names based on all the matches extracted
--=================================================
=========
/**
* Union all matches found by strong rules, except the ones
directly come
* from dictionary matches
*/
create view PersonStrongWithNewLine as
(select P.person as person from Person1 P)
--union all
-- (select P.person as person from Person1a_more P)
union all
(select P.person as person from Person3 P)
union all
(select P.person as person from Person4 P)
union all
(select P.person as person from Person3P1 P);
create view PersonStrongSingleTokenOnly as
(select P.person as person from Person5 P)
union all
(select P.person as person from Person6 P)
union all
(select P.firstname as person from FirstName P)
union all
(select P.lastname as person from LastName P)
union all
(select P.person as person from Person1a P);
-- Yunyao: added 05/09/2008 to expand person names with
suffix
create view PersonStrongSingleTokenOnlyExpanded1 as
select CombineSpans(P.person,S.suffix) as person
from
PersonStrongSingleTokenOnly P,
PersonSuffix S
where
FollowsTok(P.person, S.suffix, 0, 0);
-- Yunyao: added 04/14/2009 to expand single token person
name with a single initial
-- extend single token person with a single initial
create view PersonStrongSingleTokenOnlyExpanded2 as
select CombineSpans(R.person, RightContext(R.person,2)) as
person
from PersonStrongSingleTokenOnly R
where MatchesRegex(/ +[p{Upper}]bs*/,
RightContext(R.person,3));
create view PersonStrongSingleToken as
(select P.person as person from
PersonStrongSingleTokenOnly P)
union all
(select P.person as person from
PersonStrongSingleTokenOnlyExpanded1 P)
union all
(select P.person as person from
PersonStrongSingleTokenOnlyExpanded2 P);
/**
* Union all matches found by weak rules
*/
create view PersonWeak1WithNewLine as
(select P.person as person from Person3r1 P)
union all
(select P.person as person from Person3r2 P)
union all
(select P.person as person from Person4r1 P)
union all
(select P.person as person from Person4r2 P)
union all
(select P.person as person from Person2 P)
union all
(select P.person as person from Person2a P)
union all
(select P.person as person from Person3P2 P)
union all
(select P.person as person from Person3P3 P);
-- weak rules that identify (LastName, FirstName)
create view PersonWeak2WithNewLine as
(select P.person as person from Person4a P)
union all
(select P.person as person from Person4ar1 P)
union all
(select P.person as person from Person4ar2 P);

--include 'core/GenericNE/Person-FilterNewLineSingle.aql';
--include 'core/GenericNE/Person-Filter.aql';

Person

create view PersonBase as
(select P.person as person from PersonStrongWithNewLine
P)
union all
(select P.person as person from PersonWeak1WithNewLine
P)
union all
(select P.person as person from PersonWeak2WithNewLine
P);
output view PersonBase;

“Global financial services firm Morgan Stanley announced … ““
“Global financial services firm Morgan Stanley announced …
create view Person3r1 as

create view ValidLastNameAll as
select N.lastname as lastname

© 2009 IBM Corporation
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges

Mais conteúdo relacionado

Mais procurados

Webinar: How MongoDB is Used to Manage Reference Data - May 2014
Webinar: How MongoDB is Used to Manage Reference Data - May 2014Webinar: How MongoDB is Used to Manage Reference Data - May 2014
Webinar: How MongoDB is Used to Manage Reference Data - May 2014MongoDB
 
Harness the power of Big Data
Harness the power of Big DataHarness the power of Big Data
Harness the power of Big Dataarms8586
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
Architecting a-big-data-platform-for-analytics 24606569
Architecting a-big-data-platform-for-analytics 24606569Architecting a-big-data-platform-for-analytics 24606569
Architecting a-big-data-platform-for-analytics 24606569Kun Le
 
Oracle Corporation
Oracle CorporationOracle Corporation
Oracle CorporationPrakhar Omar
 
Age of Exploration: How to Achieve Enterprise-Wide Discovery
Age of Exploration: How to Achieve Enterprise-Wide DiscoveryAge of Exploration: How to Achieve Enterprise-Wide Discovery
Age of Exploration: How to Achieve Enterprise-Wide DiscoveryInside Analysis
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016Kent Graziano
 
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your ProductDell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your ProductManuel "Manny" Rodriguez-Perez
 
Webinar: How Banks Manage Reference Data with MongoDB
 Webinar: How Banks Manage Reference Data with MongoDB Webinar: How Banks Manage Reference Data with MongoDB
Webinar: How Banks Manage Reference Data with MongoDBMongoDB
 
GoogleQuoteWSJ.290213955
GoogleQuoteWSJ.290213955GoogleQuoteWSJ.290213955
GoogleQuoteWSJ.290213955ypai
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
Power BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual WorkshopPower BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual WorkshopCCG
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceInside Analysis
 
Enterprise Search - Introduction
Enterprise Search - IntroductionEnterprise Search - Introduction
Enterprise Search - IntroductionAmplexor
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachDATAVERSITY
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureZaloni
 

Mais procurados (20)

Webinar: How MongoDB is Used to Manage Reference Data - May 2014
Webinar: How MongoDB is Used to Manage Reference Data - May 2014Webinar: How MongoDB is Used to Manage Reference Data - May 2014
Webinar: How MongoDB is Used to Manage Reference Data - May 2014
 
Harness the power of Big Data
Harness the power of Big DataHarness the power of Big Data
Harness the power of Big Data
 
Bi 5
Bi 5Bi 5
Bi 5
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Architecting a-big-data-platform-for-analytics 24606569
Architecting a-big-data-platform-for-analytics 24606569Architecting a-big-data-platform-for-analytics 24606569
Architecting a-big-data-platform-for-analytics 24606569
 
Oracle Corporation
Oracle CorporationOracle Corporation
Oracle Corporation
 
Age of Exploration: How to Achieve Enterprise-Wide Discovery
Age of Exploration: How to Achieve Enterprise-Wide DiscoveryAge of Exploration: How to Achieve Enterprise-Wide Discovery
Age of Exploration: How to Achieve Enterprise-Wide Discovery
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your ProductDell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
Dell Technology World - IT as a Business - Multi-Cloud Strategy is your Product
 
Webinar: How Banks Manage Reference Data with MongoDB
 Webinar: How Banks Manage Reference Data with MongoDB Webinar: How Banks Manage Reference Data with MongoDB
Webinar: How Banks Manage Reference Data with MongoDB
 
GoogleQuoteWSJ.290213955
GoogleQuoteWSJ.290213955GoogleQuoteWSJ.290213955
GoogleQuoteWSJ.290213955
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
Semantic Technology in Publishing & Finance
Semantic Technology in Publishing & FinanceSemantic Technology in Publishing & Finance
Semantic Technology in Publishing & Finance
 
SharePoint Alternatives
SharePoint AlternativesSharePoint Alternatives
SharePoint Alternatives
 
Power BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual WorkshopPower BI Advanced Data Modeling Virtual Workshop
Power BI Advanced Data Modeling Virtual Workshop
 
Agile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational IntelligenceAgile Data Rationalization for Operational Intelligence
Agile Data Rationalization for Operational Intelligence
 
Iod 2013 Jackman Schwenger
Iod 2013 Jackman SchwengerIod 2013 Jackman Schwenger
Iod 2013 Jackman Schwenger
 
Enterprise Search - Introduction
Enterprise Search - IntroductionEnterprise Search - Introduction
Enterprise Search - Introduction
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data Architecture
 

Destaque

Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsBenjamin Habegger
 
Information Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataInformation Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataGerard de Melo
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the WebTommaso Teofili
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesYunyao Li
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersSriTeja Allaparthi
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalSvitlana volkova
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - SlidesAnkush Jain
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaAhmedali Durga
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsGUANBO
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosisask2372
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Textbutest
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalChen Xi
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITAnkit Sharma
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2ndhit_alex
 
Information Retrieval and Extraction
Information Retrieval and ExtractionInformation Retrieval and Extraction
Information Retrieval and ExtractionChristopher Frenz
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionDeeksha thakur
 
ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...Jim Jenkins
 

Destaque (20)

Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and ToolsInformation Extraction from the Web - Algorithms and Tools
Information Extraction from the Web - Algorithms and Tools
 
Information Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram DataInformation Extraction from Web-Scale N-Gram Data
Information Extraction from Web-Scale N-Gram Data
 
Data and Information Extraction on the Web
Data and Information Extraction on the WebData and Information Extraction on the Web
Data and Information Extraction on the Web
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
 
IRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research PapersIRE- Algorithm Name Detection in Research Papers
IRE- Algorithm Name Detection in Research Papers
 
Multimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location RetrievalMultimodal Information Extraction: Disease, Date and Location Retrieval
Multimodal Information Extraction: Disease, Date and Location Retrieval
 
Web Information Retrieval and Mining
Web Information Retrieval and MiningWeb Information Retrieval and Mining
Web Information Retrieval and Mining
 
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
[EN] Capture Indexing & Auto-Classification | DLM Forum Industry Whitepaper 0...
 
Mining Product Synonyms - Slides
Mining Product Synonyms - SlidesMining Product Synonyms - Slides
Mining Product Synonyms - Slides
 
Group-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social mediaGroup-13 Project 15 Sub event detection on social media
Group-13 Project 15 Sub event detection on social media
 
Web Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical ModelsWeb Information Extraction Learning based on Probabilistic Graphical Models
Web Information Extraction Learning based on Probabilistic Graphical Models
 
System for-health-diagnosis
System for-health-diagnosisSystem for-health-diagnosis
System for-health-diagnosis
 
Information extraction for Free Text
Information extraction for Free TextInformation extraction for Free Text
Information extraction for Free Text
 
A survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrievalA survey of_eigenvector_methods_for_web_information_retrieval
A survey of_eigenvector_methods_for_web_information_retrieval
 
Information_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIITInformation_retrieval_and_extraction_IIIT
Information_retrieval_and_extraction_IIIT
 
Open Information Extraction 2nd
Open Information Extraction 2ndOpen Information Extraction 2nd
Open Information Extraction 2nd
 
Information Retrieval and Extraction
Information Retrieval and ExtractionInformation Retrieval and Extraction
Information Retrieval and Extraction
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...ATI Courses Professional Development Short Course Remote Sensing Information ...
ATI Courses Professional Development Short Course Remote Sensing Information ...
 

Semelhante a Enterprise information extraction: recent developments and open challenges

Future of Power: Big Data - Søren Ravn
Future of Power: Big Data - Søren RavnFuture of Power: Big Data - Søren Ravn
Future of Power: Big Data - Søren RavnIBM Danmark
 
Application Consolidation and Retirement
Application Consolidation and RetirementApplication Consolidation and Retirement
Application Consolidation and RetirementIBM Analytics
 
Value proposition for big data isv partners 0714
Value proposition for big data isv partners 0714Value proposition for big data isv partners 0714
Value proposition for big data isv partners 0714Niu Bai
 
IMS10 unleash the capabilities of new technologies
IMS10   unleash the capabilities of new technologiesIMS10   unleash the capabilities of new technologies
IMS10 unleash the capabilities of new technologiesRobert Hain
 
Identity and Biometrics in the Big Data & Analytics Context
Identity and Biometrics in the Big Data & Analytics ContextIdentity and Biometrics in the Big Data & Analytics Context
Identity and Biometrics in the Big Data & Analytics ContextCharles Li
 
Li charles biometrics analytics & big data 122013a for release
Li charles    biometrics analytics & big data 122013a for releaseLi charles    biometrics analytics & big data 122013a for release
Li charles biometrics analytics & big data 122013a for releaseCharles Li
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningDataWorks Summit
 
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014Daniel Westzaan
 
5 big data at work linking discovery and bi to improve business outcomes from...
5 big data at work linking discovery and bi to improve business outcomes from...5 big data at work linking discovery and bi to improve business outcomes from...
5 big data at work linking discovery and bi to improve business outcomes from...Dr. Wilfred Lin (Ph.D.)
 
W2.0 Expo - Heid Mashups - Sept 18, 2008
W2.0 Expo - Heid Mashups - Sept 18, 2008W2.0 Expo - Heid Mashups - Sept 18, 2008
W2.0 Expo - Heid Mashups - Sept 18, 2008Mark Heid
 
Semantics for Data Architects
Semantics for Data ArchitectsSemantics for Data Architects
Semantics for Data ArchitectsJurgen Ziemer
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMBig Data Joe™ Rossi
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMBig Data Joe™ Rossi
 
Building Data Science Ecosystems for Smart Cities and Smart Commerce
Building Data Science Ecosystems for Smart Cities and Smart CommerceBuilding Data Science Ecosystems for Smart Cities and Smart Commerce
Building Data Science Ecosystems for Smart Cities and Smart CommerceAlex Liu
 
What Does Artificial Intelligence Have to Do with IT Operations?
What Does Artificial Intelligence Have to Do with IT Operations?What Does Artificial Intelligence Have to Do with IT Operations?
What Does Artificial Intelligence Have to Do with IT Operations?Precisely
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
 
IBM Cloud: Architecture for Disruption
IBM Cloud: Architecture for DisruptionIBM Cloud: Architecture for Disruption
IBM Cloud: Architecture for DisruptionJürgen Ambrosi
 
Electronic Commerce
Electronic CommerceElectronic Commerce
Electronic Commerceellamee27
 
Using Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic CommerceUsing Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic CommerceElla Mae Ayen
 

Semelhante a Enterprise information extraction: recent developments and open challenges (20)

Future of Power: Big Data - Søren Ravn
Future of Power: Big Data - Søren RavnFuture of Power: Big Data - Søren Ravn
Future of Power: Big Data - Søren Ravn
 
Application Consolidation and Retirement
Application Consolidation and RetirementApplication Consolidation and Retirement
Application Consolidation and Retirement
 
Value proposition for big data isv partners 0714
Value proposition for big data isv partners 0714Value proposition for big data isv partners 0714
Value proposition for big data isv partners 0714
 
Running an Information Services Business within a Large Corporation
Running an Information Services Business within a Large CorporationRunning an Information Services Business within a Large Corporation
Running an Information Services Business within a Large Corporation
 
IMS10 unleash the capabilities of new technologies
IMS10   unleash the capabilities of new technologiesIMS10   unleash the capabilities of new technologies
IMS10 unleash the capabilities of new technologies
 
Identity and Biometrics in the Big Data & Analytics Context
Identity and Biometrics in the Big Data & Analytics ContextIdentity and Biometrics in the Big Data & Analytics Context
Identity and Biometrics in the Big Data & Analytics Context
 
Li charles biometrics analytics & big data 122013a for release
Li charles    biometrics analytics & big data 122013a for releaseLi charles    biometrics analytics & big data 122013a for release
Li charles biometrics analytics & big data 122013a for release
 
Empowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine LearningEmpowering you with Democratized Data Access, Data Science and Machine Learning
Empowering you with Democratized Data Access, Data Science and Machine Learning
 
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
PoT - probeer de mogelijkheden van datamining zelf uit 30-10-2014
 
5 big data at work linking discovery and bi to improve business outcomes from...
5 big data at work linking discovery and bi to improve business outcomes from...5 big data at work linking discovery and bi to improve business outcomes from...
5 big data at work linking discovery and bi to improve business outcomes from...
 
W2.0 Expo - Heid Mashups - Sept 18, 2008
W2.0 Expo - Heid Mashups - Sept 18, 2008W2.0 Expo - Heid Mashups - Sept 18, 2008
W2.0 Expo - Heid Mashups - Sept 18, 2008
 
Semantics for Data Architects
Semantics for Data ArchitectsSemantics for Data Architects
Semantics for Data Architects
 
OC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBMOC Big Data Monthly Meetup #6 - Session 1 - IBM
OC Big Data Monthly Meetup #6 - Session 1 - IBM
 
SD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBMSD Big Data Monthly Meetup #4 - Session 1 - IBM
SD Big Data Monthly Meetup #4 - Session 1 - IBM
 
Building Data Science Ecosystems for Smart Cities and Smart Commerce
Building Data Science Ecosystems for Smart Cities and Smart CommerceBuilding Data Science Ecosystems for Smart Cities and Smart Commerce
Building Data Science Ecosystems for Smart Cities and Smart Commerce
 
What Does Artificial Intelligence Have to Do with IT Operations?
What Does Artificial Intelligence Have to Do with IT Operations?What Does Artificial Intelligence Have to Do with IT Operations?
What Does Artificial Intelligence Have to Do with IT Operations?
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured Data
 
IBM Cloud: Architecture for Disruption
IBM Cloud: Architecture for DisruptionIBM Cloud: Architecture for Disruption
IBM Cloud: Architecture for Disruption
 
Electronic Commerce
Electronic CommerceElectronic Commerce
Electronic Commerce
 
Using Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic CommerceUsing Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic Commerce
 

Mais de Yunyao Li

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsYunyao Li
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopYunyao Li
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and ApplicationsYunyao Li
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Yunyao Li
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language UnderstandingYunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language ProcessingYunyao Li
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Yunyao Li
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesYunyao Li
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaYunyao Li
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningYunyao Li
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingYunyao Li
 
Coling poster
Coling posterColing poster
Coling posterYunyao Li
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Yunyao Li
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsYunyao Li
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Yunyao Li
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative AnalyticsYunyao Li
 

Mais de Yunyao Li (20)

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
 

Último

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 

Último (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 

Enterprise information extraction: recent developments and open challenges

  • 1. Enterprise Information Extraction SIGMOD 2010 Tutorial Frederick Reiss, Yunyao Li, Laura Chiticariu, and Sriram Raghavan IBM Almaden Research Center © 2009 IBM Corporation
  • 2. Who we are  Researchers from the Search and Analytics group at IBM Almaden Research Center – Frederick Reiss – Yunyao Li – Laura Chiticariu – Sriram Raghavan (virtual)  Working on information extraction since 2006-08 – SystemT project – Code shipping with 8 IBM products 2 © 2009 IBM Corporation
  • 3. Road Map u Yo  What is Information Extraction? (Fred Reiss) ere h  Declarative Information Extraction (Fred Reiss) re a  What the Declarative Approach Enables – Scalable Infrastructure (Yunyao Li) – Development Support (Laura Chiticariu)  Conclusion / Q&A (Fred Reiss) 3 © 2009 IBM Corporation
  • 4. Obligatory “What is Information Extraction?” Slide  Distill structured data from unstructured and semi-structured text  Exploit the extracted data in your applications For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Annotations Annotations Name Bill Gates Bill Veghte Richard Stallman Title Organization CEO Microsoft VP Microsoft Founder Free Soft.. Richard Stallman, founder of the Free Software Foundation, countered saying… (from Cohen’s IE tutorial, 2003) 4 © 2009 IBM Corporation
  • 5. Bibliography at the end of the slide deck. SIGMOD 2006 Tutorial [Doan06] in One Slide  Information extraction has been an area of study in Natural Language Processing and AI for years  Core ideas from database research not a part of existing work in this area – Declarative languages – Well-defined semantics – Cost-based optimization  The challenge: Can we build a “System R” for information extraction?  Survey of early-stage projects attacking this problem 5 © 2009 IBM Corporation
  • 6. What’s new?  New enterprise-focused applications…  …driving new requirements…  …leading to declarative approaches 6 © 2009 IBM Corporation
  • 7. Enterprise Applications of Information Extraction  Previous tutorial showed research prototypes – Avatar: Semantic search on personal emails – DBLife: Use IE to build a knowledge base about database researchers – AliBaba: IE over medical research papers  Since then, IE has gone mainstream – Enterprise Semantic Search – Enterprise Data as a Service – Business Intelligence – Data-driven Enterprise Mashups 7 © 2009 IBM Corporation
  • 8. Enterprise Semantic Search  Use information extraction to improve accuracy and presentation of search results Extract geographical information Extract acronyms and their meanings Gumshoe (IBM) [Zhu07,Li06] Identify pages in different parts of the intranet that are about the same topic 8 © 2009 IBM Corporation
  • 9. Enterprise Data as a Service  Extract and clean useful information hidden in publicly available documents  Rent the extracted information over the Internet DBLife [1] Midas (IBM) (Demo today!) 9 ...<issuer> ...<issuer> <issuerCik>0000070858</issuerCik> <issuerCik>0000070858</issuerCik> <issuerName>BANK OF AMERICA CORP /DE/</issuerName> <issuerName>BANK OF AMERICA CORP /DE/</issuerName> <issuerTradingSymbol>BAC</issuerTradingSymbol> <issuerTradingSymbol>BAC</issuerTradingSymbol> </issuer> </issuer> <reportingOwner> <reportingOwner> <reportingOwnerId> <reportingOwnerId> <rptOwnerCik>0001090355</rptOwnerCik> <rptOwnerCik>0001090355</rptOwnerCik> <rptOwnerName>THAIN JOHN A</rptOwnerName> <rptOwnerName>THAIN JOHN A</rptOwnerName> </reportingOwnerId> </reportingOwnerId> <reportingOwnerAddress> <reportingOwnerAddress> <rptOwnerStreet1>C/O GOLDMAN SACHS GROUP</rptOwnerStreet1> <rptOwnerStreet1>C/O GOLDMAN SACHS GROUP</rptOwnerStreet1> <rptOwnerStreet2>85 BROAD STREET</rptOwnerStreet2> <rptOwnerStreet2>85 BROAD STREET</rptOwnerStreet2> <rptOwnerCity>NEW YORK</rptOwnerCity> <rptOwnerCity>NEW YORK</rptOwnerCity> ... ... </reportingOwnerAddress> </reportingOwnerAddress> <reportingOwnerRelationship> <reportingOwnerRelationship> <isOfficer>1</isOfficer> <isOfficer>1</isOfficer> <officerTitle>Pres Glbl Bkg Sec &amp; Wlth Mgmt</officerTitle> <officerTitle>Pres Glbl Bkg Sec &amp; Wlth Mgmt</officerTitle> </reportingOwnerRelationship> </reportingOwnerRelationship> </reportingOwner> ... </reportingOwner> ... © 2009 IBM Corporation
  • 10. Enterprise Data Public Data Business Intelligence 10 Social networks Traditional BI Tools Blogs Government data Information Extraction Data Warehouse Emails Call center records Legacy data New BI Tools Important applications Important applications  Marketing: Customer sentiment, brand  Marketing: Customer sentiment, brand management management  Legal: Electronic legal discovery,  Legal: Electronic legal discovery, identifying product pipeline problems identifying product pipeline problems  Strategy: Important economic events,  Strategy: Important economic events, monitoring competitors monitoring competitors © 2009 IBM Corporation
  • 11. IBM eDiscovery Analyzer Enterprise Data Public Data Business Intelligence 11 Social networks Traditional BI Tools Blogs Government data Information Extraction Data Warehouse Emails Call center records Legacy data New BI Tools Important applications Important applications  Marketing: Customer sentiment, brand  Marketing: Customer sentiment, brand management management  Legal: Electronic legal discovery,  Legal: Electronic legal discovery, identifying product pipeline problems identifying product pipeline problems  Strategy: Important economic events,  Strategy: Important economic events, monitoring competitors monitoring competitors © 2009 IBM Corporation
  • 12. Data-Driven Mashups  Extract structured information from unstructured feeds  Join extracted information with other structured enterprise data IBM Lotus Notes Live Text IBM InfoSphere MashupHub [Simmen09] 12 © 2009 IBM Corporation
  • 13. Enterprise Information Extraction  IE has become increasingly important to emerging enterprise applications  Set of requirements driven by enterprise apps that use information extraction – Scalability • Large data volumes, often orders of magnitude larger than classical NLP corpora – Accuracy • Garbage-in garbage-out: Usefulness of application is often tied to quality of extraction – Usability • Building an accurate IE system is labor-intensive • Professional programmers are much more expensive than grad students! 13 © 2009 IBM Corporation
  • 14. A Canonical IE System Feature Selection Text 14 Entity Identification Features Entity Resolution Entities and Relationships Structured Information © 2009 IBM Corporation
  • 15. A Canonical IE System Feature Selection Text Entity Identification Features Entity Resolution Entities and Relationships Structured Information  Boundaries between these stages are not clear-cut  This diagram shows a simplified logical data flow – Traditionally, physical data flow the same as logical – But the systems we’ll talk about take a very different approach to the actual order of execution 15 © 2009 IBM Corporation
  • 16. Feature Selection  Identify features – Very simple, “atomic” entities – Inputs for other stages  Examples of features – Dictionary match – Regular expression match – Part of speech  Typical components used – Off-the-shelf morphology package – Many simple rules  Very time-consuming and underappreciated 16 © 2009 IBM Corporation
  • 17. Entity Identification  Use basic features to build more complex features – Example: …was done by Mr. Jack Gurbingal at the… Dictionary match: Common first name + Regular expr match: Capitalized word = Complex feature: Potential person name  Use other features to determine which of the complex features are instances of entities and relationships  Most information extraction research focuses on this stage – Variety of different techniques 17 © 2009 IBM Corporation
  • 18. Entity Resolution  Perform complex analyses over entities and relationships  Examples – Identify entities that refer to the same person or thing – Join extracted information with external structured data  Not the main focus of this tutorial – But interacts with other parts of information extraction 18 © 2009 IBM Corporation
  • 19. Obligatory Person-Phone Example Call John Merker at 555-1212. John also has a cell #: 555-1234 19 © 2009 IBM Corporation
  • 22. Person-Phone Example: Entities and Relationships Feature Selection Text Entity Entity Identification Identification Features Person Entity Resolution Structured Information Entities, Rels. . Phone Call John Merker at 555-1212. John also has a cell #: 555-1234 Person 22 NumType Phone © 2009 IBM Corporation
  • 23. Person-Phone Example: Entities and Relationships Feature Selection Text Same Same Person Person Entity Identification Features Person Entity Resolution Structured Information Entities, Rels. Join with Join with office phone office phone directory directory Phone Call John Merker at 555-1212. John also has a cell #: 555-1234 Person 23 NumType Phone © 2009 IBM Corporation
  • 24. Road Map  What is Information Extraction? are u Yo ere h  Declarative Information Extraction  What the Declarative Approach Enables – Scalable Infrastructure (Yunyao Li) – Development Support (Laura Chiticariu)  Conclusion / Q&A (Fred Reiss) 24 © 2009 IBM Corporation
  • 25. Declarative Information Extraction  Overview of traditional approaches to information extraction  Practical issues with applying traditional approaches  How recent work has used declarative approaches to address these issues  Different types of declarative approaches 25 © 2009 IBM Corporation
  • 26. Traditional Approaches to Information Extraction  Two dominant types: – Rule-Based – Machine Learning-Based  Distinction is based on how Entity Identification is performed Feature Selection Text 26 Entity Identification Features Entity Resolution Entities and Relationships Structured Information © 2009 IBM Corporation
  • 27. Anatomy of a Rule-Based System Example Documents Feature Selection Rules Feature Selection Text 27 Entity Identification Rules Entity Identification Features Entity Resolution Entities, Rels. Structured Information © 2009 IBM Corporation
  • 28. Anatomy of a Machine Learning-Based System Labeled Documents Example Documents Features and Labels Feature Selection Feature Selection Rules Feature Selection Text 28 Training Model Entity Identification Features Entity Resolution Entities, Rels. Structured Information © 2009 IBM Corporation
  • 29. A Brief History of IE in the NLP Community Rule-Based  1978-1997: MUC (Message Understanding Conference) – DARPA competition 1987 to 1997 – FRUMP [DeJong82] – FASTUS [Appelt93], – TextPro, PROTEUS  1998: Common Pattern Specification Language (CPSL) standard [Appelt98] – Standard for subsequent rulebased systems  1999-2010: Commercial products, GATE Machine Learning  At first: Simple techniques like Naive Bayes  1990’s: Learning Rules – AUTOSLOG [Riloff93] – CRYSTAL [Soderland98] – SRV [Freitag98]  2000’s: More specialized models – Hidden Markov Models [Leek97] – Maximum Entropy Markov Models [McCallum00] – Conditional Random Fields [Lafferty01] – Automatic feature expansion For further reading: Sunita Sarawagi’s Survey [Sarawagi08], Claire Cardie’s Survey [Cardie97] 29 © 2009 IBM Corporation
  • 30. Tying the System Together: Traditional IE Frameworks  Traditional approach: Workflow system – Sequence of discrete steps – Data only flows forward  GATE1 and UIMA2 are the most popular frameworks – Type systems and standard data formats  Web services and Hadoop also in common use – No standard data format Workflow for the ANNIE system [Cunningham09] 30 1. GATE (General Architecture for Text Engineering) official web site: http://gate.ac.uk/ 2. Apache UIMA (Unstructured Information Management Architecture) official web site: http://uima.apache.org/ © 2009 IBM Corporation
  • 31. Sequential Execution in CPSL Rules rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Level 2 〈Person〉 〈Token〉[~ “at”] 〈Phone〉  〈PersonPhone〉 〈Person〉 〈Token〉[~ “at”] 〈Phone〉  〈PersonPhone〉 rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam tus, risus in sagittis facilisis arcu auguet rum velit, sed <Person> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es Level 1 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. oin, in <FirstName> <CapsWord> at <Phone> amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. 〈Digits〉 〈Token〉[~ “-”] 〈Digits〉  〈Phone〉 〈Digits〉 〈Token〉[~ “-”] 〈Digits〉  〈Phone〉 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proi enina i facilisis, <Person> at <Digits>-<Digits> arcu tincidun orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla 〈FirstName〉 〈CapsWord〉  〈Person〉 〈FirstName〉 〈CapsWord〉  〈Person〉 rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam tus, risus in sagittis facilisis arcu augue velit, <FirstName> <CapsWord> at <Digits>-<Digits>. hendrerit faucibus pede mi ipsum. rabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in © 2009 IBM ultrices sit giat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, Corporation Level 0 (Feature Selection) 31
  • 32. Problems with Traditional IE Approaches  Complex, fixed pipelines and rule sets  Semantics tied to order of execution Scalability Data only flows forward, leading to wasted work in early stages. Accuracy Lots of custom procedural code. Usability 32 Hard to understand why the system produces a particular result. © 2009 IBM Corporation
  • 33. Declarative to the Rescue!  Define the logical constraints between rules/components  System determines order of execution Scalability Optimizer avoids wasted work Accuracy More expressive rule languages; Combine different tools easily Usability Describe what to extract, instead of how to extract it 33 © 2009 IBM Corporation
  • 34. What do we mean by “declarative”?  Common vision: – Separate semantics from order of execution – Build the system around a language like SQL or Datalog  Different systems have different interpretations  Three main categories – High-Level Declarative • Most common approach – Completely Declarative – Mixed Declarative 34 © 2009 IBM Corporation
  • 35. High-Level Declarative  Replace the overall IE framework with a declarative language  Each individual extraction component is still a “black box”  Example 1: SQoUT[Jain08] SQL query Catalog of Extraction Modules 35 Optimizer Query plan combines extraction modules with scan and index access to data. © 2009 IBM Corporation
  • 36. High-Level Declarative  Replace the overall IE framework with a declarative language  Each individual extraction component is still a “black box”  Example 1: SQoUT[Jain08]  Example 2: PSOX[Bohannon08] 36 © 2009 IBM Corporation
  • 37. High-Level Declarative  Replace the overall IE framework with a declarative language  Each individual extraction component is still a “black box”  Example 1: SQoUT[Jain08]  Example 2: PSOX[Bohannon08]  Advantages: – Allows use of many existing “black box” packages – High-level performance optimizations possible – Clear semantics for using different packages for the same task  Drawbacks: – Doesn’t address issues that occur within a given “black box” – Limited opportunities for optimization, unless “black boxes” can provide hints 37 © 2009 IBM Corporation
  • 38. Completely Declarative  One declarative language covers all stages of extraction  Example 1: AQL language in SystemT [Chiticariu10] -- Find all matches -- of a dictionary create view Name as extract dictionary CommonFirstName on D.text as name from Document D; -- Match people with their -- phone numbers create view PersonPhone as select P.name as person, N.num as phone from Person P, PhoneNum N where … Feature Selection Text 38 Entity Identification Features -- Find pairs of references -- to the same person create view SamePerson as select P1.name as name1, P2.name as name2 from Person P1, Person P2 where … Entity Resolution Entities, Rels. Structured Information © 2009 IBM Corporation
  • 39. Sequential Execution in CPSL Rules rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam tus, risus in e sagittis facilisis, arcu augue rutrum velit, sed <PersonPhone>, hendrerit faucibus pede mi sed ipsum. Curabitur cursus cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es nc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum s dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti Level 2 〈Person〉 〈Token〉[~ “at”] 〈Phone〉  〈PersonPhone〉 〈Person〉 〈Token〉[~ “at”] 〈Phone〉  〈PersonPhone〉 rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam tus, risus in sagittis facilisis arcu auguet rum velit, sed <Person> at <Phone> hendrerit faucibus pede mi ipsum. Curabitur cursus cidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, es Level 1 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. oin, in <FirstName> <CapsWord> at <Phone> amet lt arcu tincidunt orci. Pellentesque justo tellus , scelerisque quis, acilisis nunc volutpat enim, quis viverra lacus nulla sit lectus. 〈Digits〉 〈Token〉[~ “-”] 〈Digits〉  〈Phone〉 〈Digits〉 〈Token〉[~ “-”] 〈Digits〉  〈Phone〉 Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proi enina i facilisis, <Person> at <Digits>-<Digits> arcu tincidun orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla 〈FirstName〉 〈CapsWord〉  〈Person〉 〈FirstName〉 〈CapsWord〉  〈Person〉 rem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivam tus, risus in sagittis facilisis arcu augue velit, <FirstName> <CapsWord> at <Digits>-<Digits>. hendrerit faucibus pede mi ipsum. rabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in © 2009 IBM ultrices sit giat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, Corporation Level 0 (Feature Selection) 39
  • 40. Declarative Semantics Example: Identifying Musician-Instrument Relationships (pipe | guitar | hammond organ |…) (Person Annotator)   Instrument Person 〈Person〉 〈0-5 tokens〉 〈Instrument〉  PersonPlaysInstrument John Pipe John Pipe plays the guitar plays the guitar 〈Person〉 〈Person〉 〈Token〉 〈Token〉 〈Instrument〉 John Pipe plays the guitar Person Person Instrument 〈Person〉 〈Instrument〉 〈Token〉 〈Token〉 〈Instrument〉 John Pipe 〈Person〉 plays 〈Token〉 the 〈Token〉 guitar 〈Instrument〉 Person Instrument 40 © 2009 IBM Corporation
  • 41. Completely Declarative  One declarative language covers all stages of extraction  Example 1: AQL language in SystemT [Chiticariu10]  Example 2: Conditional Random Fields in SQL [Wang10] 41 © 2009 IBM Corporation
  • 42. Completely Declarative  One declarative language covers all stages of extraction  Example 1: AQL language in SystemT [Chiticariu10]  Example 2: Conditional Random Fields in SQL [Wang10]  Advantages: – Unified language  clear semantics from top to bottom – Optimizer has full control over low-level operations – Can incorporate existing packages using user-defined functions  Drawbacks: – Code inside UDFs doesn’t benefit from declarativeness 42 © 2009 IBM Corporation
  • 43. Mixed Declarative  Language provides declarativeness at the level of some, but not all, of the extraction operations, both at the individual and pipeline level  Example: Xlog (CIMPLE) [Shen07] This Datalog predicate represents a large, opaque block of extraction code. This predicate is defined in Datalog, using low-level operations. 43 Extraction program for talk extracts, from [1] © 2009 IBM Corporation
  • 44. Mixed Declarative  Language provides declarativeness at the level of some, but not all, of the extraction operations, both at the individual and pipeline level  Example: Xlog (CIMPLE) [Shen08]  Advantages: – Ability to reuse existing “black box” packages – Optimizer gets some flexibility to reorder low-level operations  Drawbacks: – Challenging to build an optimizer that does both “high-level” and “low-level” optimizations 44 © 2009 IBM Corporation
  • 45. Declarative to the Rescue!  Different notions of declarativeness in different systems  All kinds address the major issues in enterprise IE, but in different ways Scalability Optimizer avoids wasted work Accuracy More expressive rule languages; Combine different tools easily Usability Describe what to extract, instead of how to extract it 45 © 2009 IBM Corporation
  • 46. Road Map  What is Information Extraction? (Fred Reiss)  Declarative Information Extraction (Fred Reiss)  What the Declarative Approach Enables Y 46 – Scalable Infrastructure (Yunyao Li) ere h – Development Support (Laura Chiticariu) re a ou  Conclusion/Questions © 2009 IBM Corporation
  • 47. Scalable Infrastructure Yunyao Li IBM Almaden Research Center © 2009 IBM Corporation
  • 48. Declarative to the Rescue!  Define the logical constraints between rules/components  System determines order of execution Scalability Optimizer avoids wasted work Accuracy More expressive rule languages; Combine different tools easily Usability Describe what to extract, instead of how to extract it 48 © 2009 IBM Corporation
  • 49. Conventional vs. Declarative IE Infrastructure  Conventional: – Operational semantics and implementation are hard-coded and interconnected  Declarative: – Separate semantics from implementation. – Database-style design: Optimizer + Runtime Declarative Declarative Language Language Extraction Extraction Pipeline Pipeline 49 Runtime Runtime Environment Environment Optimizer Optimizer Plan Plan Runtime Runtime Environment Environment © 2009 IBM Corporation
  • 50. Why Declarative IE for Scalability  An informal experimental study [Reiss08] – Collection of 4.5 million web logs – Band Review Annotator: identify informal reviews of concerts 20x faster CPSL-based implementation 50 Declarative implementation © 2009 IBM Corporation
  • 51. Different Aspects of Design for Scalability  Optimization – Granularity • High-level: annotator composition • Low-level: basic extraction operators – Strategy: • Rewrite-based • Cost-based  Runtime Model – Document-Centric vs. Collection-Centric 51 © 2009 IBM Corporation
  • 52. Optimization Granularity for Declarative IE  Annotator Composition – Each annotator extracts one or more entities or relationships  Basic Extraction Operator – Each operator represents an atomic extraction operation • E.g. Person annotator – Black box assumption on how an annotator works – Optimizing composition of extraction pipeline High-level declarative 52 Mixed declarative • E.g. dictionary matching, regular expression, join,… – System is fully aware of how each extraction operator works – Optimizing each basic extraction operator Completely declarative © 2009 IBM Corporation
  • 53. Optimization Strategies for Declarative IE  Rewrite-based – Applying rewrite rules to transform the declarative form of the annotators to a equivalent form that is more efficient  Cost-Based – Enumerating all possible physical execution plans, estimate their cost, and choose the one with the minimum expected cost Systems may mix these two approaches 53 © 2009 IBM Corporation
  • 54. Runtime Model for Declarative IE  Document-Centric  Collection-Centric Annotations Annotated Document Stream Runtime Runtime Environment Environment Runtime Runtime Environment Environment Input Document Stream 54 Annotations Annotations Document Document Collection Collection Auxiliary Auxiliary index index © 2009 IBM Corporation
  • 55. Systems  CIMPLE  RAD  SQout  SystemT  BayesStore 55 © 2009 IBM Corporation
  • 56. Cimple  Rewrite-based optimization [Shen07] – Inverted-index based simple pattern matching • Shared document scan AND AND AND Ullman OR * P1= “(Jeff|Jeffery)ss*Ullman” P2=“(Jeff|Jeffery)ss*Naughton” P3=“Laurass*Haas” P4=“Peterss*Haas” Simple patterns AND Naughton OR * (p1) (p2) AND AND AND Haas Lauras * AND Peters s* (p3) Haas * Naughton P2 Lauras P3 P4 Haas Jeffs Jefferys s* P1 Peters Jeffs Jefferys s* Ullman P3, P4 Inverted Index s* (p4) Parse trees 56 © 2009 IBM Corporation
  • 57. Cimple  Pushing down text properties [Shen07] – Eg: To find an all-capitalized line σallcaps(x) lines(d,x,n) σallcaps(x) lines(d,x,n) σcontainCaps(d) Plan a  Scoping Plan b [Shen07] – Imposing location conditions on where to extract spans • Eg: Check for names only within two lines of the occurrence of titles Incorporating cost-model to decide how to apply the rewrite. 57 © 2009 IBM Corporation
  • 58. Cimple  Collection-centric runtime model – Document collection (or snapshots of document collection) – Previous extraction results  Reusing previous extraction results [Chen08][Chen09] • Similar to maintaining materialized views • Cyclex: IE program viewed as one big blackbox [Chen08] • Delex: IE program viewed as a workflow of blackboxes [Chen09] 58 © 2009 IBM Corporation
  • 59. RAD [Khaitan09]  Query language: a declarative subset of CPSL specification – Regular expressions over features and existing annotations Query tokenization chunking Sentence Document Document Collection Collection Document Document Inverted index Inverted index Generating indexed features • Dictionary lookup (Eg. First name) • Part of speech lookup (Eg. Noun, verb) • Regular expression on tokens (E.g. CapsWord, Alphanum) Optimizer Optimizer Generating derived entities over the index using series of join operators (E.g. Person, Organization) Document Document Inverted index Inverted index ++Annotations Annotations Offline process 59 © 2009 IBM Corporation
  • 60. RAD  Cost-based Optimization based on Posting-list Statistics • E.g. ANYWORD@ANYWORD.com for Email Another zig-zag join over the inverted index R3 Zig-zag Join over the inverted index R2 R1 ANYWORD . ANYWORD @ Plan a 60 c o R4 R2 m ANYWORD R1 @ R3 . c o m ANYWORD Plan b © 2009 IBM Corporation
  • 61. RAD  Rewrite-based Optimization – Share sub-expression evaluation • Evaluate the same sub-expression only once 61 © 2009 IBM Corporation
  • 62. Declarative to the Rescue!  Define the logical constraints between rules/components  System determines order of execution Scalability Optimizer avoids wasted work Accuracy More expressive rule languages; Combine different tools easily Usability Describe what to extract, instead of how to extract it 62 © 2009 IBM Corporation
  • 63. Conventional vs. Declarative IE Infrastructure  Conventional: – Operational semantics and implementation are hard-coded and interconnected  Declarative: – Separate semantics from implementation. – Database-style design: Optimizer + Runtime Declarative Declarative Language Language Extraction Extraction Pipeline Pipeline 63 Runtime Runtime Environment Environment Optimizer Optimizer Plan Plan Runtime Runtime Environment Environment © 2009 IBM Corporation
  • 64. Different Aspects of Design for Scalability  Optimization – Granularity • High-level: annotator composition • Low-level: basic extraction operators – Strategy: • Rewrite-based • Cost-based  Runtime Model – Document-Centric vs. Collection-Centric 64 © 2009 IBM Corporation
  • 65. Systems  CIMPLE  RAD  SQout  SystemT  BayesStore 65 © 2009 IBM Corporation
  • 66. SQoUT [Ipeirotis07][Jain07,08,09]  Focus on composition of extraction systems SQL Query Entities/relations to extract Extraction Extraction System Repository System Repository System E0 0 Retrieval Retrieval Strategy Strategy … … … … Extraction Extraction Retrieval Retrieval Strategy Strategy System Em m 66 Query Data Data Cleaning Cleaning Document Document Collection Collection Extraction results results Extracted View © 2009 IBM Corporation
  • 67. SQoUT  Cost-based Query Optimization  New Plan Enumeration Strategies – Document retrieval strategies • Eg: filtered scan – Running the annotator only over potentially relevant docs – Join execution • Independent join, outer/inner join, zig-zag join: – Extraction results of one relation can determine the docs retrieved for another relation.  Efficiency vs. Quality Cost Model Goodness 67 Quality Efficiency Weight © 2009 IBM Corporation
  • 68. SystemT [Reiss08] [Krishnamurthy08] [Chiticariu10] Final Plan Rules PrePreprocessor processor Blocks Planner Planner Plan Enumerator Block Plans PostPostprocessor processor Cost Model • Divide rules into compilation blocks. • Rewrite-based optimization within each block 68 • Merge block plans into a single operator graph. • System R Style Costbased optimization within each block. • Rewrite-based optimization across blocks. © 2009 IBM Corporation
  • 69. Example: Restricted Span Evaluation (RSE)  Leverage the sequential nature of text – Join predicates on character or token distance  Only evaluate the inner on the relevant portions of the document  Limited applicability – Need to guarantee exact same results Only look for dictionary matches in the vicinity of a phone number. 69 John Smith at 555-1212 RSEJoin 555-1212 John Smith Regex Dictionary …John Smith at 555-1212… © 2009 IBM Corporation
  • 70. Example: Shared Dictionary Matching (SDM)  Rewrite-based optimization – Applied to the algebraic plan during postprocessing  Evaluate multiple dictionaries in a single pass D1 Dict D2 subplan 70 Dict D1 D2 subplan SDMDict SDM Dictionary Operator © 2009 IBM Corporation
  • 71. SystemT  Document-centric Runtime Model: – One document at a time – Entities extracted are associated with their source document Annotated Document Stream Runtime Runtime Environment Environment Input Document Stream Why one document at a time? 71 © 2009 IBM Corporation
  • 72. Scaling SystemT: From Laptop to Cluster In Lotus Notes Live Text InCognosToro Text Analytics Cognos Toro Analytics Jaql Runtime Lotus Notes Lotus Notes Client Client Email Message Hadoop Map-Reduce Jaql Function Wrapper Jaql Function Wrapper Display Annotated Email SystemT Runtime Input Adapter SystemT Runtime Output Adapter Jaql Function Wrapper Jaql Function Wrapper Input Adapter SystemT Runtime Output Adapter Jaql Function Wrapper Jaql Function Wrapper Documents Input Adapter SystemT Runtime Output Adapter Jaql Function Wrapper Jaql Function Wrapper Input Adapter SystemT Output Runtime Jaql Function Wrapper Adapter Jaql Function Wrapper Input Adapter SystemT Runtime Output Adapter Hadoop Cluster 72 © 2009 IBM Corporation
  • 73. BayesStore [Wang10]  Probabilistic declarative IE – In-database machine learning for efficiency and scalability  Text Data and Conditional Random Fields (CRF) Model document Token table 73 CRF model Factor table © 2009 IBM Corporation
  • 74. BayesStore  Viterbi Inference SQL Implementation – Implementing dynamic programming algorithm using recursive queries Rewrite-based optimization. 74 © 2009 IBM Corporation
  • 75. Summary Optimization Granularity Optimization Strategy Runtime Model  [A table here shows design choices of the Basic Annotator Rewrite-based Cost-based Document level systems] operator composition Systems Cimple RAD   SQoUT    BayesStore           SystemT 75   Collection Level     © 2009 IBM Corporation
  • 76. Road Map  What is Information Extraction? (Fred Reiss)  Declarative Information Extraction (Fred Reiss)  What the Declarative Approach Enables You ar 76 e here – Scalable Infrastructure (Yunyao Li) – Development Support (Laura Chiticariu) © 2009 IBM Corporation
  • 77. Development Support (Tooling) Laura Chiticariu IBM Almaden Research Center © 2009 IBM Corporation
  • 78. Declarative to the Rescue!  Define the logical constraints between rules/components  System determines order of execution Scalability Optimizer avoids wasted work Accuracy More expressive rule languages; Combine different tools easily Usability Describe what to extract, instead of how to extract it 78 © 2009 IBM Corporation
  • 79. A Canonical IE System Feature Selection Text Entity Identification Features Entity Resolution Entities and Relationships Structured Information Developing IE systems is an extremely time-consuming, error prone process 79 © 2009 IBM Corporation
  • 80. The Life Cycle of an IE System Development Usage / Maintenance Develop Use Developer 1. Features 2. Rules / labeled data Analyze 80 Test Refine User Test © 2009 IBM Corporation
  • 81. Example 1: Explaining Extraction Results ---------------------------------------- Document Preprocessing --------------------------------------create view Doc as select D.text as text from DocScan D; ------------------------------------------------------------------------------- Document Preprocessing -- Basic Named Entity Annotators -----------------------------------------------------------------------------create view Doc as select D.text as text -- Find initial words from DocScan D; create view InitialWord1 as select R.match as word -----------------------------------------from Regex(/b([p{Upper}].s*){1,5}b/, Doc.text) R -- Basic Named Entity Annotators 10, Doc.text) R from RegexTok(/([p{Upper}].s*){1,5}/, ----------------------------------------- added on 04/18/2008 where Not(MatchesRegex(/M.D./, R.match)); -- Find initial words -- Yunyao: view InitialW ord1 as capture names with prefix create added on 11/21/2008 to (we use it asR.match as word select initial -- to avoid adding too many commplex rules) --from Regex(/b([p{Upper}].s*){1,5}b/, Doc.text) create view InitialWord2 as R select D.match as word from RegexTok(/([p{Upper}].s*){1,5}/, 10, from Dictionary('specialNamePrefix.dict', Doc.text) D; Doc.text) R create view InitialWord as -- added on 04/18/2008 (select I.word as word from InitialWord1R.match)); where Not(MatchesRegex(/M.D./, I) union all (select I.word as word from InitialWord2 I); -- Yunyao: added on 11/21/2008 to capture names with prefix (we use it as initial -- Find weak initial words -- to avoid adding too many create view WeakInitialWord as commplex rules) select R.match as word ord2 as create view InitialW --from Regex(/b([p{Upper}].?s*){1,5}b/, Doc.text) R; select D.match as word from RegexTok(/([p{Upper}].?s*){1,5}/, 10, Doc.text) R from Dictionary('specialNamePrefix.dict', Doc.text) -D;added on 05/12/2008 -- Do not allow weak initial word to be a word longer than three characters create view InitialW ord as where Not(ContainsRegex(/[p{Upper}]{3}/, R.match)) (select I.word as -- added on 04/14/2009 word from InitialWord1 I) union all -- Do not allow weak initial words to match the timezon and Not(ContainsDict('timeZone.dict', R.match)); I); (select I.word as word from InitialWord2 ------------------------------------------------ Strong Phone Numbers -- Find weak initial words ----------------------------------------------create view W eakInitialWord as create dictionary StrongPhoneVariantDictionary as ( select 'phone', R.match as word --from Regex(/b([p{Upper}].?s*){1,5}b/, Doc.text) 'cell', R; 'contact', 'direct', RegexTok(/([p{Upper}].?s*){1,5}/, 10, from 'office', Doc.text) R -- Yunyao: Added new strong clues for phone numbers -- added on 05/12/2008 'tel', Do not allow weak initial word to be a word -'dial', longer than three characters 'Telefon', where 'mobile', Not(ContainsRegex(/[p{Upper}]{3}/, R.match)) 'Ph', 'Phone Number', -- added on 04/14/2009 'Direct Line', allow weak initial words to match the -- Do not 'Telephone timezon No', 'TTY', Not(ContainsDict('timeZone.dict', R.match)); and 'Toll Free', 'Toll-free', ------------------------------------------------ German -- Strong Phone Numbers 'Fon', ----------------------------------------------'Telefon Geschaeftsstelle', 'Telefon Geschäftsstelle', create dictionary StrongPhoneVariantDictionary as ( 'Telefon Zweigstelle', 'phone', 'Telefon Hauptsitz', 'cell', 'Telefon (Geschaeftsstelle)', 'contact', 'Telefon (Geschäftsstelle)', 'direct', 'Telefon (Zweigstelle)', 'office', 'Telefon (Hauptsitz)', -- Yunyao: Added new strong clues for phone 'Telefonnummer', numbers 'Telefon Geschaeftssitz', 'Telefon Geschäftssitz', 'tel', 'Telefon (Geschaeftssitz)', 'dial', 'Telefon (Geschäftssitz)', 'Telefon', 'Telefon Persönlich', 'mobile', 'Telefon persoenlich', 'Ph', 'Telefon (Persönlich)', 'Phone Number', 'Telefon (persoenlich)', 'Direct 'Handy', Line', 'Handy-Nummer', 'Telephone No', 'Telefon arbeit', 'TTY', 'TelefonFree', 'Toll (arbeit)' ); 'Toll-free', create view Initial as --'Junior' (Yunyao: comments out to avoid mismatches such as Junior National [team player], -- If we can have large negative dictionary to eliminate such mismatches, -- then this may be recovered --'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name') -- for German names -- TODO: need further test ,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor', 'Herr Professor', 'Frau professor', 'Baron', 'graf' -- Find dictionary matches for all title initials create view LastName as select C.lastname as lastname --from Consolidate(ValidLastNameAll.lastname) C; from ValidLastNameAll C consolidate on C.lastname; select D.match as initial --'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name') -- for German names -- TODO: need further test ,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor', 'Herr Professor', 'Frau professor', 'Baron', 'graf' ); -- Find dictionary matches for all first names -- Mostly US first names create view StrictFirstName1 as select D.match as firstname from Dictionary('strictFirst.dict', Doc.text) D --where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); ); -- German first names create view StrictFirstName2 as select D.match as firstname from Dictionary('strictFirst_german.dict', Doc.text) D --where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); -- Find dictionary matches for all title initials from Dictionary('InitialDict', Doc.text) D; -- Yunyao: added 05/09/2008 to capture person name suffix create dictionary PersonSuffixDict as ( ',jr.', ',jr', 'III', 'IV', 'V', 'VI' ); create view PersonSuffix as select D.match as suffix from Dictionary('PersonSuffixDict', Doc.text) D; -- Find capitalized words that look like person names and not in the non-name dictionary create view CapsPersonCandidate as select R.match as name --from Regex(/bp{Upper}p{Lower}[p{Alpha}]{1,20}b/, Doc.text) R --from Regex(/bp{Upper}p{Lower}[p{Alpha}]{0,10}(['-][p{Upper}])?[p{Alpha}]{1,10}b/, Doc.text) R -- change to enable unicode match --from Regex(/bp{Lu}p{M}*[p{Ll}p{Lo}]p{M}*[p{L}p{M}*]{0,10}(['-][p{Lu}p{M}*])?[p{L}p{M}*]{1,10}b/, Doc.text) R --from Regex(/bp{Lu}p{M}*[p{Ll}p{Lo}]p{M}*[p{L}p{M}*]{0,10}(['-][p{Lu}p{M}*])?(p{L}p{M}*){1,10}b/, Doc.text) R -- Allow fully capitalized words --from Regex(/bp{Lu}p{M}*(p{L}p{M}*){0,10}(['-][p{Lu}p{M}*])?(p{L}p{M}*){1,10}b/, Doc.text) R from RegexTok(/p{Lu}p{M}*(p{L}p{M}*){0,10}(['-][p{Lu}p{M}*])?(p{L}p{M}*){1,10}/, 4, Doc.text) R --' where Not(ContainsDicts( 'FilterPersonDict', 'filterPerson_position.dict', 'filterPerson_german.dict', 'InitialDict', 'StrongPhoneVariantDictionary', 'stateList.dict', 'organization_suffix.dict', 'industryType_suffix.dict', 'streetSuffix_forPerson.dict', 'wkday.dict', 'nationality.dict', 'stateListAbbrev.dict', 'stateAbbrv.ChicagoAPStyle.dict', R.match)); create view CapsPerson as select C.name as name from CapsPersonCandidate C where Not(MatchesRegex(/(p{Lu}p{M}*)+-.*([p{Ll}p{Lo}]p{M}*).*/, C.name)) and Not(MatchesRegex(/.*([p{Ll}p{Lo}]p{M}*).*-(p{Lu}p{M}*)+/, C.name)); create view CapsPersonNoP as select CP.name as name from CapsPerson CP where Not(ContainsRegex(/'/, CP.name)); --' create dictionary InitialDict as ( 'Pro','Bono','Enterprises','Group','Said','Says','Assista nt','Vice','Warden','Contribution', 'rev.', 'col.', 'reverend', 'prof.', 'professor.', 'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.', 'Sales', 'Research', 'Development', 'Product', 'messrs.', 'dr.', 'master.', 'marquis', 'monsieur', 'Support', 'Manager', 'Telephone', 'Phone', 'Contact', 'ds', 'di' 'Information', --'Dear' (Yunyao: comments out to avoid mismatches such as 'Electronics','Managed','West','East','North','South', Dear Member), 'Teaches','Ministry', 'Church', avoid mismatches such --'Junior' (Yunyao: comments out to'Association', as'Laboratories', [team player], Junior National 'Living', 'Community', 'Visiting', -- 'Officer', have large negative'Only', 'Additionally', such If we can 'After', 'Pls', 'FYI', dictionary to eliminate mismatches, 'Acquire', 'Addition', 'America', 'Adding', -- then this phrases that are likely to be at the start of a -- short may be recovered sentence 'Yes', 'No', 'Ja', 'Nein','Kein', 'Keine', 'Gegenstimme', -- TODO: to be double checked 'Another', 'Anyway','Associate', 'At', 'Athletes', 'It', 'Enron', 'EnronXGate', 'Have', 'However', 'Company', 'Companies', 'IBM','Annual', -- common verbs appear with person names in financial reports -- ideally we want to have a general comprehensive verb list to use as a filter dictionary 'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees', 'Warns', 'Announces', 'Reviews' -- Laura 06/02/2009: new filter dict for title for SEC domain in filterPerson_title.dict ); create dictionary GreetingsDict as ( 'Hey', 'Hi', 'Hello', 'Dear', -- German greetings 'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo', -- Italian 'Ciao', -- Spanish 'Hola', -- French 'Bonjour' ); 81 create dictionary InitialDict as ( 'rev.', 'col.', 'reverend', 'prof.', 'professor.', 'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.', 'messrs.', 'dr.', 'master.', 'marquis', 'monsieur', 'ds', 'di' --'Dear' (Yunyao: comments out to avoid mismatches such as Dear Member), -- Spain first name from blue pages create view StrictFirstName7 as select D.match as firstname from Dictionary('names/strictFirst_spain.dict', Doc.text) D where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); --============================================================ -- Find strict capitalized words --create view StrictCapsPerson as create view StrictCapsPerson as select R.name as name from StrictCapsPersonR R where MatchesRegex(/bp{Lu}p{M}*[p{Ll}p{Lo}]p{M}*(p{L}p{M}*){1,20}b/, R.name); -- Find dictionary matches for all last names create view StrictLastName1 as select D.match as lastname from Dictionary('strictLast.dict', Doc.text) D --where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match); create view StrictLastName3 as select D.match as lastname from Dictionary('strictLast_german_bluePages.dict', Doc.text) D --where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match); create view StrictLastName4 as select D.match as lastname from Dictionary('uniqMostCommonSurname.dict', Doc.text) D --where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match); create view StrictLastName6 as select D.match as lastname from Dictionary('names/strictLast_france.dict', Doc.text) D where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match); create view StrictLastName7 as select D.match as lastname from Dictionary('names/strictLast_spain.dict', Doc.text) D where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match); create view StrictLastName8 as select D.match as lastname from Dictionary('names/strictLast_india.partial.dict', Doc.text) D where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match); create view StrictLastName9 as select D.match as lastname from Dictionary('names/strictLast_israel.dict', Doc.text) D where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match); create view StrictLastName as (select S.lastname as lastname from StrictLastName1 S) union all (select S.lastname as lastname from StrictLastName2 S) union all (select S.lastname as lastname from StrictLastName3 S) union all (select S.lastname as lastname from StrictLastName4 S) union all (select S.lastname as lastname from StrictLastName5 S) union all (select S.lastname as lastname from StrictLastName6 S) union all (select S.lastname as lastname from StrictLastName7 S) union all (select S.lastname as lastname from StrictLastName8 S) union all (select S.lastname as lastname from StrictLastName9 S); -- Relaxed version of last name create view RelaxedLastName1 as select CombineSpans(SL.lastname, CP.name) as lastname from StrictLastName SL, StrictCapsPerson CP where FollowsTok(SL.lastname, CP.name, 1, 1) and MatchesRegex(/-/, SpanBetween(SL.lastname, CP.name)); create view RelaxedLastName2 as select CombineSpans(CP.name, SL.lastname) as lastname from StrictLastName SL, StrictCapsPerson CP where FollowsTok(CP.name, SL.lastname, 1, 1) and MatchesRegex(/-/, SpanBetween(CP.name, SL.lastname)); -- all the last names create view LastNameAll as (select N.lastname as lastname from StrictLastName N) union all (select N.lastname as lastname from RelaxedLastName1 N) union all (select N.lastname as lastname from RelaxedLastName2 N); from Dictionary('names/name_israel.dict', Doc.text) D where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); from FirstName FN, InitialWord IW, CapsPerson CP where FollowsTok(FN.firstname, IW.word, 0, 0) and FollowsTok(IW.word, CP.name, 0, 0); create view NamesAll as (select P.name as name from NameDict P) union all (select P.name as name from NameDict1 P) union all (select P.name as name from NameDict2 P) union all (select P.name as name from NameDict3 P) union all (select P.name as name from NameDict4 P) union all (select P.firstname as name from FirstName P) union all /** * Translation for Rule 3r2 * * This relaxed version of rule '3' will find person names like Thomas B.M . David * But it only insists that the second word is in the person dictionary */ /* <rule annotation=Person id=3r2> <internal> <token attribute={etc}>CAPSPERSON</token> <token attribute={etc}>INITIALWORD</token> <token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token> </internal> </rule>*/ create view PersonDict as select C.name as name --from Consolidate(NamesAll.name) C; from NamesAll C consolidate on C.name; create view Person3r2 as select CombineSpans(CP.name, LN.lastname) as person from LastName LN, InitialWord IW, CapsPerson CP where FollowsTok(CP.name, IW.word, 0, 0) and FollowsTok(IW.word, LN.lastname, 0, 0); --========================================================== -- Actual Rules --========================================================== /** * Translation for Rule 4 * * This rule will find person names like David Thomas */ /* <rule annotation=Person id=4> <internal> <token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token> <token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token> </internal> </rule> */ create view Person4WithNewLine as select CombineSpans(FN.firstname, LN.lastname) as person from FirstName FN, LastName LN where FollowsTok(FN.firstname, LN.lastname, 0, 0); -- For 3-part Person names create view Person3P1 as select CombineSpans(F.firstname, L.lastname) as person from StrictFirstName F, StrictCapsPersonR S, StrictLastName L where FollowsTok(F.firstname, S.name, 0, 0) --and FollowsTok(S.name, L.lastname, 0, 0) and FollowsTok(F.firstname, L.lastname, 1, 1) and Not(Equals(GetText(F.firstname), GetText(L.lastname))) and Not(Equals(GetText(F.firstname), GetText(S.name))) and Not(Equals(GetText(S.name), GetText(L.lastname))) and Not(ContainsRegex(/[nrt]/, SpanBetween(F.firstname, L.lastname))); create view Person3P2 as select CombineSpans(P.name, L.lastname) as person from PersonDict P, StrictCapsPersonR S, StrictLastName L where FollowsTok(P.name, S.name, 0, 0) --and FollowsTok(S.name, L.lastname, 0, 0) and FollowsTok(P.name, L.lastname, 1, 1) and Not(Equals(GetText(P.name), GetText(L.lastname))) and Not(Equals(GetText(P.name), GetText(S.name))) and Not(Equals(GetText(S.name), GetText(L.lastname))) and Not(ContainsRegex(/[nrt]/, SpanBetween(P.name, L.lastname))); -- Yunyao: 05/20/2008 revised to Person4WrongCandidates due to performance reason -- NOTE: current optimizer execute Equals first thus make Person4Wrong very expensive --create view Person4Wrong as --select CombineSpans(FN.firstname, LN.lastname) as person --from FirstName FN, -LastName LN --where FollowsTok(FN.firstname, LN.lastname, 0, 0) -- and ContainsRegex(/[nr]/, SpanBetween(FN.firstname, LN.lastname)) -- and Equals(GetText(FN.firstname), GetText(LN.lastname)); create view Person3P3 as select CombineSpans(F.firstname, P.name) as person from PersonDict P, StrictCapsPersonR S, StrictFirstName F where FollowsTok(F.firstname, S.name, 0, 0) --and FollowsTok(S.name, P.name, 0, 0) and FollowsTok(F.firstname, P.name, 1, 1) and Not(Equals(GetText(P.name), GetText(F.firstname))) and Not(Equals(GetText(P.name), GetText(S.name))) and Not(Equals(GetText(S.name), GetText(F.firstname))) and Not(ContainsRegex(/[nrt]/, SpanBetween(F.firstname, P.name))); create view Person4WrongCandidates as select FN.firstname as firstname, LN.lastname as lastname from FirstName FN, LastName LN where FollowsTok(FN.firstname, LN.lastname, 0, 0) and ContainsRegex(/[nr]/, SpanBetween(FN.firstname, LN.lastname)); /** * Translation for Rule 1 * Handles names of persons like Mr. Vladimir E. Putin */ /* <rule annotation=Person id=1> <token attribute={etc}INITIAL{etc}>CANYWORD</token> <internal> <token attribute={etc}>CAPSPERSON</token> <token attribute={etc}>INITIALW ORD</token> <token attribute={etc}>CAPSPERSON</token> </internal> </rule> */ SystemT’s Person extractor SystemT’s Person extractor create view StrictCapsPersonR as select R.match as name --from Regex(/bp{Lu}p{M}*(p{L}p{M}*){1,20}b/, CapsPersonNoP.name) R; from RegexTok(/p{Lu}p{M}*(p{L}p{M}*){1,20}/, 1, CapsPersonNoP.name) R; create view StrictLastName5 as select D.match as lastname from Dictionary('names/strictLast_italy.dict', Doc.text) D where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match); -- new entries -- France first name from blue pages create view StrictFirstName6 as select D.match as firstname from Dictionary('names/strictFirst_france.dict', Doc.text) D where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); -- Israel first name from blue pages create view StrictFirstName9 as select D.match as firstname from Dictionary('names/strictFirst_israel.dict', Doc.text) D where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); 'Pro','Bono','Enterprises','Group','Said','Says','Assistant','Vice 'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My', ','Warden','Contribution', 'His','Her', 'Research', 'Development', 'Product', 'Sales', 'Support', 'Their','Popcorn', 'Name', 'July', 'June','Join', 'Manager', 'Telephone', 'Phone', 'Contact', 'Information', 'Business', 'Administrative', 'South', 'Members', 'Electronics','Managed','West','East','North','South', 'Address', 'Please', 'List', 'Teaches','Ministry', 'Church', 'Association', 'Laboratories', 'Public', 'Inc', 'Parkway', 'Living', 'Community', 'Visiting', 'Brother', 'Buy', 'Then', 'Officer', 'After', 'Pls', 'FYI', 'Only', 'Additionally', 'Adding', 'Services', 'Statements', 'Acquire', 'Addition', 'America', 'Commissioner', 'President', 'Governor', -- short phrases that are likely to be at the start of a sentence 'Commitment', 'Commits', 'Hey', 'Yes', 'No', 'Ja','End', 'Exit', 'Experiences', 'Finance', 'Director', 'Nein','Kein', 'Keine', 'Gegenstimme', -- TODO: to be double checked 'Elementary', 'W ednesday', 'At', 'Athletes', 'It', 'Enron', 'Another', 'Anyway','Associate', 'Nov', 'Infrastructure', 'Inside', 'Convention', 'EnronXGate', 'Have', 'However', 'Judge', 'Lady', 'Friday', 'Project', 'Company', 'Companies', 'IBM','Annual', 'Projected', 'Recalls', 'Regards', 'Recently', 'Administration', -- common verbs appear with person names in financial reports 'Independence', 'Denied', -- ideally we want to have a general comprehensive verb list 'Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike', to 'W as', a filter dictionary use as 'Were', 'Secretary', 'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees', 'Speaker', 'Chairman', 'Consider', 'Consultant', 'Warns', 'Announces', 'Reviews' 'County', 'Court', 'Defensive', -- Laura 06/02/2009: new filter dict for title for SEC domain in 'Northwestern', filterPerson_title.dict 'Place', 'Hi', 'Futures', 'Athlete', ); 'Invitational', 'System', 'International', 'Main', 'Online', 'Ideally' -- Italy first name from blue pages create view StrictFirstName5 as select D.match as firstname from Dictionary('names/strictFirst_italy.dict', Doc.text) D where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); --============================================================ --TODO: need to think through how to deal with hypened name -- one way to do so is to run Regex(pattern, CP.name) and enforce CP.name does not contain ' -- need more testing before confirming the change create view StrictLastName2 as select D.match as lastname from Dictionary('strictLast_german.dict', Doc.text) D --where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/((p{L}p{M}*)+s+)?p{Lu}p{M}*.{1,20}/, D.match); create dictionary GreetingsDict as -- more entries ( ,'If','Our', 'About', 'Analyst', 'On', 'Of', 'By', 'HR', 'Hey', 'Hi', 'Hello', 'Dear', 'Mkt', 'Pre', 'Post', -- German greetings 'Ice', 'Surname', 'Lastname', 'Condominium', 'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo', 'firstname', 'Name', 'familyname', -- Italian -- Italian greeting 'Ciao', 'Ciao', -- Spanish 'Hola', -- Spanish greeting -- French 'Hola', 'Bonjour' -- French greeting ); 'Bonjour', -- german first name from blue page create view StrictFirstName4 as select D.match as firstname from Dictionary('strictFirst_german_bluePages.dict', Doc.text) D --where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); -- Find strict capitalized words with two letter or more (relaxed version of StrictCapsPerson) 'President', 'Governor', 'Commissioner', 'Commitment', --include 'core/GenericNE/Person.aql'; 'Commits', 'Hey', 'Director', 'End', 'Exit', 'Experiences', 'Finance', 'Elementary', 'Wednesday', 'Nov', 'Infrastructure', 'Inside', 'Convention', 'Judge', 'Lady', 'Friday', 'Project', 'Projected', create dictionary FilterPersonDict as 'Recalls', 'Regards', 'Recently', 'Administration', ( 'Independence', 'Denied', 'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', 'Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike', 'Was', 'All','Tell', 'Were', 'Secretary', 'Speaker', 'Chairman', 'Consider', 'Consultant', 'County', 'Friends', 'Friend', 'Colleague', 'Colleagues', 'Court', 'Defensive', 'Managers','If', 'Northwestern', 'Place', 'Hi', 'Futures', 'Athlete', 'Invitational', 'Customer', 'Users', 'User', 'Valued', 'Executive', 'System', 'Chairs', 'International', 'Main', 'Online', 'Ideally' 'New', 'Owner', 'Conference', 'Please', 'Outlook', -- more entries 'Lotus', 'Notes', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt', 'Pre', ,'If','Our', 'About', 'This', 'That', 'There', 'Here', 'Subscribers', 'W hat', 'Post', 'W hen', 'Where', 'Which', 'Condominium', 'Ice', 'Surname', 'Lastname', 'firstname', 'Name', 'familyname', 'Thanks', 'Thanksgiving','Senator', 'W ith', 'While', -- Italian greeting 'Platinum', 'Perspective', 'Ciao', 'Manager', 'Ambassador', 'Professor', 'Dear', -- Spanish greeting 'Athelet', 'Contact', 'Cheers', 'Hola', 'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', -- French greeting 'The', 'Take', 'Junior', 'Bonjour', 'Both', 'Communities', 'Greetings', 'Hope', -- new entries 'Restaurants', 'Properties', -- nick names for US first names create view StrictFirstName3 as select D.match as firstname from Dictionary('strictNickName.dict', Doc.text) D --where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); -- Indian first name from blue pages -- TODO: still need to clean up the remaining entries create view StrictFirstName8 as select D.match as firstname from Dictionary('names/strictFirst_india.partial.dict', Doc.text) D where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); -- German --include 'core/GenericNE/Person.aql'; 'Fon', 'Telefon Geschaeftsstelle', 'Telefon Geschäftsstelle', create dictionary FilterPersonDict as 'Telefon Zweigstelle', ( 'Telefon Hauptsitz', 'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', 'All','Tell', 'Telefon (Geschaeftsstelle)', 'Friends', 'Friend', 'Colleague', 'Colleagues', 'Managers','If', 'Telefon (Geschäftsstelle)', 'Customer', 'Users', 'User', 'Valued', 'Executive', 'Chairs', 'Telefon (Zweigstelle)', 'New', 'Owner', 'Conference', 'Please', 'Outlook', 'Lotus', 'Telefon (Hauptsitz)', 'Notes', 'Telefonnummer', 'This', 'That', 'There', 'Here', 'Subscribers', 'What', 'When', 'Where', 'Which', 'Telefon Geschaeftssitz', 'With', 'While', 'Thanks', 'Thanksgiving','Senator', 'Platinum', 'Telefon Geschäftssitz', 'Perspective', (Geschaeftssitz)', 'Telefon 'Manager', 'Ambassador', 'Professor', 'Dear', 'Contact', 'Telefon (Geschäftssitz)', 'Cheers', 'Athelet', 'Telefon Persönlich', 'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', 'The', 'Take', 'Telefon persoenlich', 'Junior', 'Telefon (Persönlich)', 'Both', 'Communities', 'Greetings', 'Hope', 'Restaurants', 'Properties', (persoenlich)', 'Telefon 'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My', 'His','Her', 'Handy', 'Their','Popcorn', 'Name', 'July', 'June','Join', 'Handy-Nummer', 'Business', 'Administrative', 'South', 'Members', 'Address', 'Telefon arbeit', 'Please', 'List',(arbeit)' 'Telefon 'Public', 'Inc', 'Parkway', 'Brother', 'Buy', 'Then', 'Services', ); 'Statements', --------------------------------------create view ValidLastNameAll as select N.lastname as lastname from LastNameAll N -- do not allow partially all capitalized words where Not(MatchesRegex(/(p{Lu}p{M}*) +-.*([p{Ll}p{Lo}]p{M}*).*/, N.lastname)) and Not(MatchesRegex(/.*([p{Ll}p{Lo}]p{M}*).*(p{Lu}p{M}*)+/, N.lastname)); -- union all the dictionary matches for first names create view StrictFirstName as (select S.firstname as firstname from StrictFirstName1 S) union all (select S.firstname as firstname from StrictFirstName2 S) union all (select S.firstname as firstname from StrictFirstName3 S) union all (select S.firstname as firstname from StrictFirstName4 S) union all (select S.firstname as firstname from StrictFirstName5 S) union all (select S.firstname as firstname from StrictFirstName6 S) union all (select S.firstname as firstname from StrictFirstName7 S) union all (select S.firstname as firstname from StrictFirstName8 S) union all (select S.firstname as firstname from StrictFirstName9 S); -- Relaxed versions of first name create view RelaxedFirstName1 as select CombineSpans(S.firstname, CP.name) as firstname from StrictFirstName S, StrictCapsPerson CP where FollowsTok(S.firstname, CP.name, 1, 1) and MatchesRegex(/-/, SpanBetween(S.firstname, CP.name)); create view Person1 as select CombineSpans(CP1.name, CP2.name) as person from Initial I, CapsPerson CP1, InitialWord IW , CapsPerson CP2 where FollowsTok(I.initial, CP1.name, 0, 0) and FollowsTok(CP1.name, IW.word, 0, 0) and FollowsTok(IW .word, CP2.name, 0, 0); --and Not(ContainsRegex(/[nr]/, SpanBetween(I.initial, CP2.name))); -- all the first names create view FirstNameAll as (select N.firstname as firstname from StrictFirstName N) union all (select N.firstname as firstname from RelaxedFirstName1 N) union all (select N.firstname as firstname from RelaxedFirstName2 N); create view ValidFirstNameAll as select N.firstname as firstname from FirstNameAll N where Not(MatchesRegex(/(p{Lu}p{M}*) +-.*([p{Ll}p{Lo}]p{M}*).*/, N.firstname)) and Not(MatchesRegex(/.*([p{Ll}p{Lo}]p{M}*).*(p{Lu}p{M}*)+/, N.firstname)); create view FirstName as select C.firstname as firstname --from Consolidate(ValidFirstNameAll.firstname) C; from ValidFirstNameAll C consolidate on C.firstname; -- Combine all dictionary matches for both last names and first names create view NameDict as select D.match as name from Dictionary('name.dict', Doc.text) D --where MatchesRegex(/p{Upper}p{Lower}[p{Alpha}]{0,20}/, D.match); --where MatchesRegex(/p{Upper}.{1,20}/, D.match); -- changed to enable unicode match where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); create view NameDict1 as select D.match as name from Dictionary('names/name_italy.dict', Doc.text) D where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); create view NameDict2 as select D.match as name from Dictionary('names/name_france.dict', Doc.text) D where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); create view NameDict3 as select D.match as name from Dictionary('names/name_spain.dict', Doc.text) D where MatchesRegex(/p{Lu}p{M}*.{1,20}/, D.match); create view NameDict4 as select D.match as name -- relaxed version of Rule4a -- Yunyao: split the following rules into two to improve performance -- TODO: Test case for optimizer -- create view Person4ar1 as -- select CombineSpans(CP.name, FN.firstname) as person --from FirstName FN, -CapsPerson CP --where FollowsTok(CP.name, FN.firstname, 1, 1) --and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname)) --and Not(M atchesRegex(/(.|n|r)*(.|?|!|'|sat|sin)( )*/, LeftContext(CP.name, 10))) --and Not(M atchesRegex(/(?i)(.+fully)/, CP.name)) --and GreaterThan(GetBegin(CP.name), 10); /** * Translation for Rule 1a * Handles names of persons like Mr. Vladimir Putin */ /* <rule annotation=Person id=1a> <token attribute={etc}INITIAL{etc}>CANYWORD</token> <internal> <token attribute={etc}>CAPSPERSON</token>{1,3} </internal> </rule>*/ ~250 AQL rules ~250 AQL rules create view RelaxedFirstName2 as select CombineSpans(CP.name, S.firstname) as firstname from StrictFirstName S, StrictCapsPerson CP where FollowsTok(CP.name, S.firstname, 1, 1) and MatchesRegex(/-/, SpanBetween(CP.name, S.firstname)); create view Person4ar1temp as select FN.firstname as firstname, CP.name as name from FirstName FN, CapsPerson CP where FollowsTok(CP.name, FN.firstname, 1, 1) and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname)); -- Split into two rules so that single token annotations are serperated from others -- Single token annotations create view Person1a1 as select CP1.name as person from Initial I, CapsPerson CP1 where FollowsTok(I.initial, CP1.name, 0, 0) --- start changing this block --- disallow allow newline and Not(ContainsRegex(/[nt]/,SpanBetween(I.initial,CP1.name))) --- end changing this block ; -- Yunyao: added 05/09/2008 to match patterns such as "Mr. B. B. Buy" /* create view Person1a2 as select CombineSpans(name.block, CP1.name) as person from Initial I, BlockTok(0, 1, 2, InitialW ord.word) name, CapsPerson CP1 where FollowsTok(I.initial, name.block, 0, 0) and FollowsTok(name.block, CP1.name, 0, 0) and Not(ContainsRegex(/[nt]/,CombineSpans(I.initial, CP1.name))); */ create view Person1a as -- ( select P.person as person from Person1a1 P -- ) -- union all -- (select P.person as person from Person1a2 P) ; /* create view Person1a_more as select name.block as person from Initial I, BlockTok(0, 2, 3, CapsPerson.name) name where FollowsTok(I.initial, name.block, 0, 0) and Not(ContainsRegex(/[nt]/,name.block)) --- start changing this block -- disallow newline and Not(ContainsRegex(/[nt]/,SpanBetween(I.initial,name.block))) --- end changing this block ; */ /** * Translation for Rule 3 * Find person names like Thomas B.M. David */ /* <rule annotation=Person id=3> <internal> <token attribute={etc}PERSON{etc}>CAPSPERSON</token> <token attribute={etc}>INITIALW ORD</token> <token attribute={etc}PERSON{etc}>CAPSPERSON</token> </internal> </rule>*/ create view Person3 as select CombineSpans(P1.name, P2.name) as person from PersonDict P1, --InitialW ord IW, WeakInitialWord IW , PersonDict P2 where FollowsTok(P1.name, IW .word, 0, 0) and FollowsTok(IW .word, P2.name, 0, 0) and Not(Equals(GetText(P1.name), GetText(P2.name))); /** * Translation for Rule 3r1 * * This relaxed version of rule '3' will find person names like Thomas B.M. David * But it only insists that the first word is in the person dictionary */ /* <rule annotation=Person id=3r1> <internal> <token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token> <token attribute={etc}>INITIALW ORD</token> <token attribute={etc}>CAPSPERSON</token> </internal> </rule> */ create view Person4 as (select P.person as person from Person4WithNewLine P) minus (select CombineSpans(P.firstname, P.lastname) as person from Person4WrongCandidates P where Equals(GetText(P.firstname), GetText(P.lastname))); /** * Translation for Rule4a * This rule will find person names like Thomas, David */ /* <rule annotation=Person id=4a> <internal> <token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token> <token attribute={etc}>,</token> <token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token> </internal> </rule> */ create view Person4a as select CombineSpans(LN.lastname, FN.firstname) as person from FirstName FN, LastName LN where FollowsTok(LN.lastname, FN.firstname, 1, 1) and ContainsRegex(/,/,SpanBetween(LN.lastname, FN.firstname)); create view Person4ar1 as select CombineSpans(P.name, P.firstname) as person from Person4ar1temp P where Not(MatchesRegex(/(.|n|r)*(.|?|!|'|sat|sin)( )*/, LeftContext(P.name, 10))) --' and Not(MatchesRegex(/(?i)(.+fully)/, P.name)) and GreaterThan(GetBegin(P.name), 10); create view Person4ar2 as select CombineSpans(LN.lastname, CP.name) as person from CapsPerson CP, LastName LN where FollowsTok(LN.lastname, CP.name, 0, 1) and ContainsRegex(/,/,SpanBetween(LN.lastname, CP.name)); /** * Translation for Rule2 * * This rule will handles names of persons like B.M . Thomas David, where Thomas occurs in some person dictionary */ /* <rule annotation=Person id=2> <internal> <token attribute={etc}>INITIALWORD</token> <token attribute={etc}PERSON{etc}>CAPSPERSON</token> <token attribute={etc}>CAPSPERSON</token> </internal> </rule> */ create view Person2 as select CombineSpans(IW.word, CP.name) as person from InitialWord IW, PersonDict P, CapsPerson CP where FollowsTok(IW.word, P.name, 0, 0) and FollowsTok(P.name, CP.name, 0, 0); /** * Translation for Rule 2a * * The rule handles names of persons like B.M . Thomas David, where David occurs in some person dictionary */ /* <rule annotation=Person id=2a> <internal> <token attribute={etc}>INITIALWORD</token> <token attribute={etc}>CAPSPERSON</token> <token attribute={etc}>NEWLINE</token>? <token attribute={etc}PERSON{etc}>CAPSPERSON</token> </internal> </rule> */ create view Person2a as select CombineSpans(IW.word, P.name) as person from InitialWord IW, CapsPerson CP, PersonDict P where FollowsTok(IW.word, CP.name, 0, 0) and FollowsTok(CP.name, P.name, 0, 0); /* <rule annotation=Person id=4r1> <internal> <token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</toke n> <token attribute={etc}>NEWLINE</token>? <token attribute={etc}>CAPSPERSON</token> </internal> </rule> */ create view Person4r1 as select CombineSpans(FN.firstname, CP.name) as person from FirstName FN, CapsPerson CP where FollowsTok(FN.firstname, CP.name, 0, 0); /** * Translation for Rule 4r2 * * This relaxed version of rule '4' will find person names Thomas, David * But it only insists that the SECOND word is in some person dictionary */ /* <rule annotation=Person id=4r2> <token attribute={etc}>ANYWORD</token> <internal> <token attribute={etc}>CAPSPERSON</token> <token attribute={etc}>NEWLINE</token>? <token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</toke n> </internal> </rule> */ create view Person4r2 as select CombineSpans(CP.name, LN.lastname) as person from CapsPerson CP, LastName LN where FollowsTok(CP.name, LN.lastname, 0, 0); /** * Translation for Rule 5 * * This rule will find other single token person first names */ /* <rule annotation=Person id=5> <internal> <token attribute={etc}>INITIALWORD</token>? <token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</toke n> </internal> </rule> */ create view Person5 as select CombineSpans(IW.word, FN.firstname) as person from InitialWord IW, FirstName FN where FollowsTok(IW.word, FN.firstname, 0, 0); /** * Translation for Rule 6 * * This rule will find other single token person last names */ /* <rule annotation=Person id=6> <internal> <token attribute={etc}>INITIALWORD</token>? <token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</toke n> </internal> </rule> */ create view Person6 as select CombineSpans(IW.word, LN.lastname) as person from InitialWord IW, LastName LN where FollowsTok(IW.word, LN.lastname, 0, 0); -================================================= ========= -- End of rules --- Create final list of names based on all the matches extracted --================================================= ========= /** * Union all matches found by strong rules, except the ones directly come * from dictionary matches */ create view PersonStrongWithNewLine as (select P.person as person from Person1 P) --union all -- (select P.person as person from Person1a_more P) union all (select P.person as person from Person3 P) union all (select P.person as person from Person4 P) union all (select P.person as person from Person3P1 P); create view PersonStrongSingleTokenOnly as (select P.person as person from Person5 P) union all (select P.person as person from Person6 P) union all (select P.firstname as person from FirstName P) union all (select P.lastname as person from LastName P) union all (select P.person as person from Person1a P); -- Yunyao: added 05/09/2008 to expand person names with suffix create view PersonStrongSingleTokenOnlyExpanded1 as select CombineSpans(P.person,S.suffix) as person from PersonStrongSingleTokenOnly P, PersonSuffix S where FollowsTok(P.person, S.suffix, 0, 0); -- Yunyao: added 04/14/2009 to expand single token person name with a single initial -- extend single token person with a single initial create view PersonStrongSingleTokenOnlyExpanded2 as select CombineSpans(R.person, RightContext(R.person,2)) as person from PersonStrongSingleTokenOnly R where MatchesRegex(/ +[p{Upper}]bs*/, RightContext(R.person,3)); create view PersonStrongSingleToken as (select P.person as person from PersonStrongSingleTokenOnly P) union all (select P.person as person from PersonStrongSingleTokenOnlyExpanded1 P) union all (select P.person as person from PersonStrongSingleTokenOnlyExpanded2 P); /** * Union all matches found by weak rules */ create view PersonWeak1WithNewLine as (select P.person as person from Person3r1 P) union all (select P.person as person from Person3r2 P) union all (select P.person as person from Person4r1 P) union all (select P.person as person from Person4r2 P) union all (select P.person as person from Person2 P) union all (select P.person as person from Person2a P) union all (select P.person as person from Person3P2 P) union all (select P.person as person from Person3P3 P); -- weak rules that identify (LastName, FirstName) create view PersonWeak2WithNewLine as (select P.person as person from Person4a P) union all (select P.person as person from Person4ar1 P) union all (select P.person as person from Person4ar2 P); --include 'core/GenericNE/Person-FilterNewLineSingle.aql'; --include 'core/GenericNE/Person-Filter.aql'; Person create view PersonBase as (select P.person as person from PersonStrongWithNewLine P) union all (select P.person as person from PersonWeak1WithNewLine P) union all (select P.person as person from PersonWeak2WithNewLine P); output view PersonBase; “Global financial services firm Morgan Stanley announced … ““ “Global financial services firm Morgan Stanley announced … create view Person3r1 as create view ValidLastNameAll as select N.lastname as lastname © 2009 IBM Corporation

Notas do Editor

  1. To update the Collection-Centric, add auxiliary index + annotation store
  2. Each extraction result is stored with its source document and its associated positions in the document
  3. Basically: Convert JAPE rule into a relational calculus expression =&gt; Big self-join over a table of &lt;word, position&gt; pairs Generate efficient join plan using (inverted) index access when possible Some part still require going back to the document --- want these high in the operator graph
  4. At the high level, the optimization strategy is very similar to the one in System R, but with novel access method, novel join algorithms, 2-dismensional cost model
  5. The document-centric model enables embedding SystemT in a wide variety of applications. For instance, in lotus notes, when a user opens an email, at the same time, that email message is sent to SystemT runtime which will generate annotations on the fly. When the email is displayed for the user, the annotations just generated will be displayed as well. Meanwhile, SystemT can also be embedded as a Map job in a map-reduce framework, which allows the system to scale up and process large volume of documents.