SlideShare a Scribd company logo
1 of 1
Methodology
Taxonomy of Normalization Edits
Tyler Baldwin and Yunyao Li
IBM Research - Almaden
An In-depth Analysis of the Effect of Text Normalization
in Social Media
Normalization For Parsing
Normalization maps sentences or their tokens to their standard form. It can help downstream applications that expect
the data in “clean” formats. Early approaches were very application-centric.
Social media normalization has been characterized in the literature as a mapping from non-standard words to their
standard form, ignoring many operations.
Perfect normalization is difficult but the typical approach may not be sufficient for all applications.
Should we return to an application-centric approach?
Previous taxonomies do not agree on the appropriate level of granularity to examine.
Several levels of granularity allow for examination of big picture and in-depth.
Three applications: Syntactic parsing, named-entity recognition, text-to-speech synthesis
-600 tweets, TREC Twitter data
-Gold standard annotated to ideal form by human annotators
-Edit types examined via ablation
Normalization For NER Normalization For TTS
ADDITION
REPLACEMENT
REMOVAL
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Coarse Granularity - Parsing
F-Measure Difference
BEVERB
DETERMINER
OTHER
SUBJECT
CAPITALIZATION
CONTRACTION
OTHER
SLANG
OTHER
TWITTER
ADDITION
REPLACEMENT
REMOVAL
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
Fine Granularity - NER
F-Measure Difference
BEVERB
DETERMINER
OTHER
SUBJECT
CAPITALIZATION
CONTRACTION
OTHER
SLANG
OTHER
TWITTER
ADDITION
REPLACEMENT
REMOVAL
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Fine Granularity - Parsing
F-Measure Difference
ADDITION
REPLACEMENT
REMOVAL
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2
Coarse Granularity - NER
F-Measure Difference
BEVERB
DETERMINER
OTHER
SUBJECT
CAPITALIZATION
CONTRACTION
OTHER
SLANG
OTHER
TWITTER
ADDITION
REPLACEMENT
REMOVAL
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35
Fine Granularity - TTS
F-Measure Difference
ADDITION
REPLACEMENT
REMOVAL
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Coarse Granularity - TTS
F-Measure Difference
Edit
Insertion Replacement Removal
Punctuation Word Word Word
Subj. Be Det. Other Slang Cont. Cap. Other Twitter Other
Punctuation Punctuation
Parsing is heavily impacted by most normalization operations. Entity recognition is mostly dependent on replacement edits.
Capitalization correction is important as well.
Speech synthesis is heavily impacted by normalization generally, but
handling domain-specific terms is critical.
@someGuy idk kinda wanna get NEW ipad
@someGuy I don't know kind of want to get NEW ipad
I don't know, I kind of want to get a new iPad.
Typical
Ideal

More Related Content

More from Yunyao Li

Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
Yunyao Li
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
Yunyao Li
 

More from Yunyao Li (20)

Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases Human in the Loop AI for Building Knowledge Bases
Human in the Loop AI for Building Knowledge Bases
 
Towards Universal Language Understanding
Towards Universal Language UnderstandingTowards Universal Language Understanding
Towards Universal Language Understanding
 
Explainability for Natural Language Processing
Explainability for Natural Language ProcessingExplainability for Natural Language Processing
Explainability for Natural Language Processing
 
Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)Towards Universal Language Understanding (2020 version)
Towards Universal Language Understanding (2020 version)
 
Towards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural LanguagesTowards Universal Semantic Understanding of Natural Languages
Towards Universal Semantic Understanding of Natural Languages
 
Exploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active LearningExploiting Structure in Representation of Named Entities using Active Learning
Exploiting Structure in Representation of Named Entities using Active Learning
 
K-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role LabelingK-SRL: Instance-based Learning for Semantic Role Labeling
K-SRL: Instance-based Learning for Semantic Role Labeling
 
Coling poster
Coling posterColing poster
Coling poster
 
Coling demo
Coling demoColing demo
Coling demo
 
Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...Natural Language Data Management and Interfaces: Recent Development and Open ...
Natural Language Data Management and Interfaces: Recent Development and Open ...
 
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified LabelsPolyglot: Multilingual Semantic Role Labeling with Unified Labels
Polyglot: Multilingual Semantic Role Labeling with Unified Labels
 
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...Transparent Machine Learning for Information Extraction: State-of-the-Art and...
Transparent Machine Learning for Information Extraction: State-of-the-Art and...
 
The Power of Declarative Analytics
The Power of Declarative AnalyticsThe Power of Declarative Analytics
The Power of Declarative Analytics
 
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open ChallengesEnterprise Search in the Big Data Era: Recent Developments and Open Challenges
Enterprise Search in the Big Data Era: Recent Developments and Open Challenges
 
SystemT: Declarative Information Extraction
SystemT: Declarative Information ExtractionSystemT: Declarative Information Extraction
SystemT: Declarative Information Extraction
 
Automatic Term Ambiguity Detection
Automatic Term Ambiguity DetectionAutomatic Term Ambiguity Detection
Automatic Term Ambiguity Detection
 
Information Extraction --- An one hour summary
Information Extraction --- An one hour summaryInformation Extraction --- An one hour summary
Information Extraction --- An one hour summary
 
Adaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text NormalizationAdaptive Parser-Centric Text Normalization
Adaptive Parser-Centric Text Normalization
 
Enterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challengesEnterprise information extraction: recent developments and open challenges
Enterprise information extraction: recent developments and open challenges
 
Automatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise searchAutomatic suggestion of query-rewrite rules for enterprise search
Automatic suggestion of query-rewrite rules for enterprise search
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 

An In-depth Analysis of the Effect of Text Normalization in Social Media

  • 1. Methodology Taxonomy of Normalization Edits Tyler Baldwin and Yunyao Li IBM Research - Almaden An In-depth Analysis of the Effect of Text Normalization in Social Media Normalization For Parsing Normalization maps sentences or their tokens to their standard form. It can help downstream applications that expect the data in “clean” formats. Early approaches were very application-centric. Social media normalization has been characterized in the literature as a mapping from non-standard words to their standard form, ignoring many operations. Perfect normalization is difficult but the typical approach may not be sufficient for all applications. Should we return to an application-centric approach? Previous taxonomies do not agree on the appropriate level of granularity to examine. Several levels of granularity allow for examination of big picture and in-depth. Three applications: Syntactic parsing, named-entity recognition, text-to-speech synthesis -600 tweets, TREC Twitter data -Gold standard annotated to ideal form by human annotators -Edit types examined via ablation Normalization For NER Normalization For TTS ADDITION REPLACEMENT REMOVAL 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Coarse Granularity - Parsing F-Measure Difference BEVERB DETERMINER OTHER SUBJECT CAPITALIZATION CONTRACTION OTHER SLANG OTHER TWITTER ADDITION REPLACEMENT REMOVAL 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 Fine Granularity - NER F-Measure Difference BEVERB DETERMINER OTHER SUBJECT CAPITALIZATION CONTRACTION OTHER SLANG OTHER TWITTER ADDITION REPLACEMENT REMOVAL 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Fine Granularity - Parsing F-Measure Difference ADDITION REPLACEMENT REMOVAL 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Coarse Granularity - NER F-Measure Difference BEVERB DETERMINER OTHER SUBJECT CAPITALIZATION CONTRACTION OTHER SLANG OTHER TWITTER ADDITION REPLACEMENT REMOVAL 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 Fine Granularity - TTS F-Measure Difference ADDITION REPLACEMENT REMOVAL 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Coarse Granularity - TTS F-Measure Difference Edit Insertion Replacement Removal Punctuation Word Word Word Subj. Be Det. Other Slang Cont. Cap. Other Twitter Other Punctuation Punctuation Parsing is heavily impacted by most normalization operations. Entity recognition is mostly dependent on replacement edits. Capitalization correction is important as well. Speech synthesis is heavily impacted by normalization generally, but handling domain-specific terms is critical. @someGuy idk kinda wanna get NEW ipad @someGuy I don't know kind of want to get NEW ipad I don't know, I kind of want to get a new iPad. Typical Ideal