Adaptive Parser-Centric Text Normalization

1
Adaptive Parser-Centric
Text Normalization
Congle Zhang* Tyler Baldwin**
Howard Ho** Benny Kimelfeld** Yunyao Li**
* University of Washington **IBM Research - Almaden

Public Text
Web Text
Private Text
Text
Analytics
Marketing
Financial investment
Drug discovery
Law enforcement
…
Applications
Social media
News
SEC
Internal
Data
Subscription
Data
USPTO
Text analytics is the key for discovering
hidden value from text

Image from http://samasource.org

CAN YOU READ THIS IN
FIRST ATTEMPT?

ay woundent of see ’ em
CAN YOU READ THIS IN FIRST ATEMPT?
00:0000:0100:02
I would not have seen them.

When a machine reads it
Results from Google translation
Chinese 唉看见他们 woundent
Spanish ay woundent de verlas
Japanese ローマ法王進呈の AY woundent
Portuguese ay woundent de vê-los
German ay woundent de voir 'em

Text Normalization
• Informal writing  standard written form
9
I would not have seen them .
normalize

Challenge: Grammar
10
text normalization
would not of see them
I would not have seen them. Vs.
mapping out-of-
vocabulary non-standard
tokens to their in-
vocabulary standard form
≠

Challenge: Domain Adaptation
Tailor the same text
normalization solution
towards different writing
style of different data
sources
11

Challenge: Evaluation
• Previous: word error rate & BLEU score
• However,
– Words are not equally important
– non-word information (punctuations,
capitalization) can be important
– Word reordering is important
• How does the normalization actually
impact the downstream applications?
12

Adaptive Parser-Centric Text
Normalization
Grammatical
Sentence
Domain
Transferrable
Parsing
performance

Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
14

Model: Replacement Generator
15
• Replacement <i,j,s>: replace tokens xi … xj-1
with s
• Domain customization
– Generic (cross-domain) replacements
– Domain-specific replacements
Ay1 woudent2 of3 see4 ‘em5
<2,3,”would not”>
<1,2,”Ay”>
<1,2,”I”>
<1,2,ε>
<6,6,”.”>
…
Edit
Same
Edit
Delete
Insert
…

Model: Boolean Variables
• Associate a unique Boolean variable Xr with
each replacement r
– Xr=true: replacement r is used to produce the
output sentence
16
<2,3,”would not”> = true
… would not …

Model: Normalization Graph
17
• A graphical model Ay woudent of see ‘em
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>

Model: Legal Assignment
• Sound
– Any two true replacements do not overlap
– <1,2,”Ay”> and <1,2,”I”> cannot be both true
• Completeness
– Every input token is captured by at least one true
replacement
18

Model: Legal = Path
• A legal assignment: a path from start to end
19
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
I would not have see him.
Output

Model: Assignment Probability
20
• Log-linear model; feature functions on edges
20
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>

Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
21

Inference
• Select the assignment with the highest
probability
• Computationally hard on general graph
models …
• But, in our model it boils down to finding the
longest path in a weighted and directed
acyclic graph 22

Inference
23
• weighted longest path
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
I would not have see him.

Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
24

Learning
• Perceptron-style algorithm
– Update weights by
– Comparing (1) most probable output with the
current weights (2) gold sequence
25
(1) Informal: Ay woudent of see ‘em
(2) Gold: I would not have seen them.
(3) Graph
Input
Output (1) weights of features

Learning: Gold vs. Inferred
26
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
Gold sequence
Most probable
sequence with current θ

Learning: Update Weights on the
Differential Edges
27
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
the gold sequence becomes “longer”
Increase wi

Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
28

Instantiation: Replacement
Generators
29
Generator From To
leave intact good good
edit distance bac back
lowercase NEED need
capitalize it It
Google spell dispaear disappear
contraction wouldn’t would not
slang language ima I am going to
insert punctuation ε .
duplicated punctuation !? !
delete filler lmao ε

Instantiation: Features
• N-gram
– Frequency of the phrases induced by an edge
• Part-of-speech
– Encourage certain behavior, such as avoiding the
deletion of noun phrases.
• Positional
– Capitalize words after stop punctuations
• Lineage
– Which generator spawned the replacement
30

Outlines
• Model
• Inference
• Learning
• Instantiation
• Evaluation
• Conclusion
31

Evaluation Metrics: Compare Parses
Input
sentence
32
Human Expert
Gold
sentence
Normalized
sentence
Normalizer
Parser
Parser
Compare
Gold
Parse
Normalized
Parse
Focus on subjects, verbs, and objects (SVO)

Evaluation Metrics: Example
Test Gold SVO
I kinda wanna
get ipad NEW
I kind of want to
get a new iPad.
verb(get) verb(want)
verb(get)
precisionv = 1/1
recallv = 1/2
subj(get,I)
subj(get,wanna)
obj(get,NEW)
subj(want, I)
subj(get,I)
obj(get,iPad)
precisionso = 1/3
recallso= 1/3
33

Evaluation: Baselines
• w/oN: without normalization
• Google: Google spell checker
• w2wN: word-to-word normalization [Han and
Baldwin 2011]
• Gw2wN: gold standard for word-to-word
normalizations of previous work (whenever
available).
34

Evaluation: Domains
• Twitter [Han and Baldwin 2011]
– Gold: Grammatical sentences
• SMS [Choudhury et al 2007]
• Call-Center Log: proprietary
– Text-based responses about users’ experience with a call-
center for a major company
35

Evaluation: Twitter
36
• Twitter-specific replacement generators
– Hashtags (#), ats (@), and retweets (RT)
– Generators that allowed for either the initial symbol or the
entire token to be deleted

Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain
specific
95.3 88.7 91.9 72.5 76.3 74.4
37
Domain-specific generators yielded the
best overall performance

Evaluation: Twitter
System
Verb Subject-Object
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain
specific
95.3 88.7 91.9 72.5 76.3 74.4
38
w/o domain-specific generators, our system
outperformed the word-to-word normalization
approaches

Evaluation: Twitter
System
Verb Subject-Object
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain
specific
95.3 88.7 91.9 72.5 76.3 74.4
39
Even perfect word-to-word normalization is not
good enough!

Evaluation: SMS
40
SMS-specific replacement generator:
- Mapping dictionary of SMS
abbreviations

Evaluation: SMS
41
System
Verb Subject-Object
w/oN 76.4 48.1 59.0 19.5 21.5 20.4
Google 85.1 61.6 71.5 22.4 26.2 24.1
w2wN 78.5 61.5 68.9 29.9 36.0 32.6
Gw2wN 87.6 76.6 81.8 38.0 50.6 43.4
generic 86.5 77.4 81.7 35.5 47.7 40.7
domain
specific
88.1 75.0 81.0 41.0 49.5 44.8

Evaluation: Call-Center
42
Call Center-specific generator:
- Mapping dictionary of call center
abbreviations
(e.g. “rep.”  “representative”)

Evaluation: Call-Center
43
System
Verb Subject-Object
w/oN 98.5 97.1 97.8 69.2 66.1 67.6
Google 99.2 97.9 98.5 70.5 67.3 68.8
generic 98.9 97.4 98.1 71.3 67.9 69.6
domain
specific
99.2 97.4 98.3 87.9 83.1 85.4

Discussion
• Domain transfer w/ small amount of effort is
possible
• Performing normalization is indeed beneficial to
dependency parsing
– Simple word-to-word normalization is not enough
44

Conclusion
• Normalization framework with an eye toward
domain adaptation
• Parser-centric view of normalization
• Our system outperformed competitive baselines
over three different domains
• Dataset to spur future research
– https://www.cs.washington.edu/node/9091/
45

Adaptive Parser-Centric Text Normalization

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Adaptive Parser-Centric Text Normalization

Semelhante a Adaptive Parser-Centric Text Normalization (20)

Mais de Yunyao Li

Mais de Yunyao Li (20)

Último

Último (20)

Adaptive Parser-Centric Text Normalization

Notas do Editor