SlideShare uma empresa Scribd logo
1 de 1
Rule-based Information Extraction is DEAD
Long Live Rule-based Information Extraction Systems!
Laura Chiticariu, Yunyao Li, Frederick Reiss
IBM Research - Almaden

THE DISCONNECT: ACADEMIC vs. INDUSTRY
Implementations of Entity Extraction

Entity Extraction Papers by Year

3.5%
21%

100%

RuleBased
Hybrid

45%

50%

RuleBased

22%

75%

17%

Hybrid

17%

Machine
Learning
Based

33%
0%

NLP Papers
(2003-2012)

All Vendors

Large Vendors

Machine
Learning
Based

Fraction of NLP Papers

67%

Commercial Products
(2013)

Year of Publication

THE EXPLANATIONS
Academia
Rule-based IE

PROs
•Declarative Heuristic
•Easy to comprehend
•Easy to maintain
•Easy to incorporate domain
knowledge
•Easy to debug

ML-based IE

PROs
•Trainable
•Adaptable
•Reduces manual effort

CONs
CONs
• Heuristic
•Requires tedious manual
labor

Industry

•Requires labeled data
•Requires retraining for
domain adaptation
•Requires ML expertise to
use or maintain
• Opaque

Evaluating
Benefits
Evaluating IE on its own of IE
Precision and Recall

Evaluating
Costs
of IE
Labor cost of writing
rules

Evaluating IE as part of
a larger process
Using ill-defined metrics
that are subject to change

Labor cost
Hardware cost
Business risk

Others
What’s the research in
Rule-based IE?

BRIDGING THE GAP
Where is the research in rule-based IE? Making it more principled, effective, and efficient
Define standard IE rule language and data model.
• What is the right data model to capture text, annotations over text, and their properties?
• Can we establish a standard declarative extensible rule language to solve most IE tasks encountered so far?
Systems research based on standard IE rule language.
• Data representation
• Automatic performance optimization
• Exploring modern hardware …
ML research based on standard IE rule language
• How to learn basic primitives such as regular expressions and dictionaries?
• How to automatically generate rules that are understandable and maintainable?

Mais conteúdo relacionado

Destaque

Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
Fabio Petroni, PhD
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
Yunyao Li
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
Rubén Izquierdo Beviá
 

Destaque (16)

Topic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpusTopic modeling and WSD on the Ancora corpus
Topic modeling and WSD on the Ancora corpus
 
Mining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completionMining at scale with latent factor models for matrix completion
Mining at scale with latent factor models for matrix completion
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
 
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged CorpusRANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus
 
CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)CLTL python course: Object Oriented Programming (3/3)
CLTL python course: Object Oriented Programming (3/3)
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
KafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF filesKafNafParserPy: a python library for parsing/creating KAF and NAF files
KafNafParserPy: a python library for parsing/creating KAF and NAF files
 
DutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systemsDutchSemCor workshop: Domain classification and WSD systems
DutchSemCor workshop: Domain classification and WSD systems
 
HSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe systemHSIENA: a hybrid publish/subscribe system
HSIENA: a hybrid publish/subscribe system
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
 
Error analysis of Word Sense Disambiguation
Error analysis of Word Sense DisambiguationError analysis of Word Sense Disambiguation
Error analysis of Word Sense Disambiguation
 
CORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization MachinesCORE: Context-Aware Open Relation Extraction with Factorization Machines
CORE: Context-Aware Open Relation Extraction with Factorization Machines
 
Juan Calvino y el Calvinismo
Juan Calvino y el CalvinismoJuan Calvino y el Calvinismo
Juan Calvino y el Calvinismo
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 

Mais de Yunyao Li

Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 

Mais de Yunyao Li (20)

The Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
 
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-LoopBuilding, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
Building, Growing and Serving Large Knowledge Graphs with Human-in-the-Loop
 
Meaning Representations for Natural Languages: Design, Models and Applications
Meaning Representations for Natural Languages:  Design, Models and ApplicationsMeaning Representations for Natural Languages:  Design, Models and Applications
Meaning Representations for Natural Languages: Design, Models and Applications
 
Taming the Wild West of NLP
Taming the Wild West of NLPTaming the Wild West of NLP
Taming the Wild West of NLP
 
Towards Deep Table Understanding
Towards Deep Table UnderstandingTowards Deep Table Understanding
Towards Deep Table Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
An In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social MediaAn In-depth Analysis of the Effect of Text Normalization in Social Media
An In-depth Analysis of the Effect of Text Normalization in Social Media
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
SystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionSystemT: Declarative Information Extraction
SystemT: Declarative Information Extraction
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Último (20)

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!

  • 1. Rule-based Information Extraction is DEAD Long Live Rule-based Information Extraction Systems! Laura Chiticariu, Yunyao Li, Frederick Reiss IBM Research - Almaden THE DISCONNECT: ACADEMIC vs. INDUSTRY Implementations of Entity Extraction Entity Extraction Papers by Year 3.5% 21% 100% RuleBased Hybrid 45% 50% RuleBased 22% 75% 17% Hybrid 17% Machine Learning Based 33% 0% NLP Papers (2003-2012) All Vendors Large Vendors Machine Learning Based Fraction of NLP Papers 67% Commercial Products (2013) Year of Publication THE EXPLANATIONS Academia Rule-based IE PROs •Declarative Heuristic •Easy to comprehend •Easy to maintain •Easy to incorporate domain knowledge •Easy to debug ML-based IE PROs •Trainable •Adaptable •Reduces manual effort CONs CONs • Heuristic •Requires tedious manual labor Industry •Requires labeled data •Requires retraining for domain adaptation •Requires ML expertise to use or maintain • Opaque Evaluating Benefits Evaluating IE on its own of IE Precision and Recall Evaluating Costs of IE Labor cost of writing rules Evaluating IE as part of a larger process Using ill-defined metrics that are subject to change Labor cost Hardware cost Business risk Others What’s the research in Rule-based IE? BRIDGING THE GAP Where is the research in rule-based IE? Making it more principled, effective, and efficient Define standard IE rule language and data model. • What is the right data model to capture text, annotations over text, and their properties? • Can we establish a standard declarative extensible rule language to solve most IE tasks encountered so far? Systems research based on standard IE rule language. • Data representation • Automatic performance optimization • Exploring modern hardware … ML research based on standard IE rule language • How to learn basic primitives such as regular expressions and dictionaries? • How to automatically generate rules that are understandable and maintainable?