Wonderful work done with Congle Zhang (my summer intern in 2012) and my IBM colleagues. Nominated for best paper award and presented at ACL 2013.
Adaptive Parser-Centric Text Normalization
Congle Zhang, Tyler Baldwin, Howard Ho, Benny Kimelfeld, Yunyao Li
Proceedings of ACL, pp. 1159--1168, 2013
2. Public Text
Web Text
Private Text
Text
Analytics
Marketing
Financial investment
Drug discovery
Law enforcement
…
Applications
Social media
News
SEC
Internal
Data
Subscription
Data
USPTO
Text analytics is the key for discovering
hidden value from text
7. ay woundent of see ’ em
CAN YOU READ THIS IN FIRST ATEMPT?
00:0000:0100:02
I would not have seen them.
8. When a machine reads it
Results from Google translation
Chinese 唉看见他们 woundent
Spanish ay woundent de verlas
Japanese ローマ法王進呈の AY woundent
Portuguese ay woundent de vê-los
German ay woundent de voir 'em
9. Text Normalization
• Informal writing standard written form
9
I would not have seen them .
normalize
ay woundent of see ’ em
10. Challenge: Grammar
10
text normalization
would not of see them
ay woundent of see ’ em
I would not have seen them. Vs.
mapping out-of-
vocabulary non-standard
tokens to their in-
vocabulary standard form
≠
12. Challenge: Evaluation
• Previous: word error rate & BLEU score
• However,
– Words are not equally important
– non-word information (punctuations,
capitalization) can be important
– Word reordering is important
• How does the normalization actually
impact the downstream applications?
12
16. Model: Boolean Variables
• Associate a unique Boolean variable Xr with
each replacement r
– Xr=true: replacement r is used to produce the
output sentence
16
<2,3,”would not”> = true
… would not …
17. Model: Normalization Graph
17
• A graphical model Ay woudent of see ‘em
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
18. Model: Legal Assignment
• Sound
– Any two true replacements do not overlap
– <1,2,”Ay”> and <1,2,”I”> cannot be both true
• Completeness
– Every input token is captured by at least one true
replacement
18
19. Model: Legal = Path
• A legal assignment: a path from start to end
19
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
I would not have see him.
Output
20. Model: Assignment Probability
20
• Log-linear model; feature functions on edges
20
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
22. Inference
• Select the assignment with the highest
probability
• Computationally hard on general graph
models …
• But, in our model it boils down to finding the
longest path in a weighted and directed
acyclic graph 22
23. Inference
23
• weighted longest path
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
I would not have see him.
25. Learning
• Perceptron-style algorithm
– Update weights by
– Comparing (1) most probable output with the
current weights (2) gold sequence
25
(1) Informal: Ay woudent of see ‘em
(2) Gold: I would not have seen them.
(3) Graph
Input
Output (1) weights of features
26. Learning: Gold vs. Inferred
26
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
Gold sequence
Most probable
sequence with current θ
27. Learning: Update Weights on the
Differential Edges
27
<4,6,”see him”>
<1,2,”Ay”> <1,2,”I”>
<2,4,”would not have”> <2,3,”would”>
<4,5,”seen”>
<5,6,”them”>
*START*
*END*
<6,6,”.”>
<3,4,”of”>
the gold sequence becomes “longer”
Increase wi
29. Instantiation: Replacement
Generators
29
Generator From To
leave intact good good
edit distance bac back
lowercase NEED need
capitalize it It
Google spell dispaear disappear
contraction wouldn’t would not
slang language ima I am going to
insert punctuation ε .
duplicated punctuation !? !
delete filler lmao ε
30. Instantiation: Features
• N-gram
– Frequency of the phrases induced by an edge
• Part-of-speech
– Encourage certain behavior, such as avoiding the
deletion of noun phrases.
• Positional
– Capitalize words after stop punctuations
• Lineage
– Which generator spawned the replacement
30
32. Evaluation Metrics: Compare Parses
Input
sentence
32
Human Expert
Gold
sentence
Normalized
sentence
Normalizer
Parser
Parser
Compare
Gold
Parse
Normalized
Parse
Focus on subjects, verbs, and objects (SVO)
33. Evaluation Metrics: Example
Test Gold SVO
I kinda wanna
get ipad NEW
I kind of want to
get a new iPad.
verb(get) verb(want)
verb(get)
precisionv = 1/1
recallv = 1/2
subj(get,I)
subj(get,wanna)
obj(get,NEW)
subj(want, I)
subj(get,I)
obj(get,iPad)
precisionso = 1/3
recallso= 1/3
33
34. Evaluation: Baselines
• w/oN: without normalization
• Google: Google spell checker
• w2wN: word-to-word normalization [Han and
Baldwin 2011]
• Gw2wN: gold standard for word-to-word
normalizations of previous work (whenever
available).
34
35. Evaluation: Domains
• Twitter [Han and Baldwin 2011]
– Gold: Grammatical sentences
• SMS [Choudhury et al 2007]
– Gold: Grammatical sentences
• Call-Center Log: proprietary
– Text-based responses about users’ experience with a call-
center for a major company
– Gold: Grammatical sentences
35
36. Evaluation: Twitter
36
• Twitter-specific replacement generators
– Hashtags (#), ats (@), and retweets (RT)
– Generators that allowed for either the initial symbol or the
entire token to be deleted
37. Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain
specific
95.3 88.7 91.9 72.5 76.3 74.4
37
Domain-specific generators yielded the
best overall performance
38. Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain
specific
95.3 88.7 91.9 72.5 76.3 74.4
38
w/o domain-specific generators, our system
outperformed the word-to-word normalization
approaches
39. Evaluation: Twitter
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 83.7 68.1 75.1 31.7 38.6 34.8
Google 88.9 78.8 83.5 36.1 46.3 40.6
w2wN 87.5 81.5 84.4 44.5 58.9 50.7
Gw2wN 89.8 83.8 86.7 46.9 61.0 53.0
generic 91.7 88.9 90.3 53.6 70.2 60.8
domain
specific
95.3 88.7 91.9 72.5 76.3 74.4
39
Even perfect word-to-word normalization is not
good enough!
43. Evaluation: Call-Center
43
System
Verb Subject-Object
Pre Rec F1 Pre Rec F1
w/oN 98.5 97.1 97.8 69.2 66.1 67.6
Google 99.2 97.9 98.5 70.5 67.3 68.8
generic 98.9 97.4 98.1 71.3 67.9 69.6
domain
specific
99.2 97.4 98.3 87.9 83.1 85.4
44. Discussion
• Domain transfer w/ small amount of effort is
possible
• Performing normalization is indeed beneficial to
dependency parsing
– Simple word-to-word normalization is not enough
44
45. Conclusion
• Normalization framework with an eye toward
domain adaptation
• Parser-centric view of normalization
• Our system outperformed competitive baselines
over three different domains
• Dataset to spur future research
– https://www.cs.washington.edu/node/9091/
45
much of the big data in the text form is bad data that are difficult to analyze, even for human being.
The average reading speed for English is 250 words per minute. With this short sentence of only 5 tokens, one should only need no more than 2 seconds.
None of the translation really makes much sense!
While there are a number of previous work on text normalization, in this work, we seek to address several new challenges.
Why fully grammatical?
Most NLP algorithms are trained over News articles, such as WSJ and NYT.
A replacement generator is a function that takes a sequence of token as input and generate one or more replacement.
Each replacement is in the form of a triplet.
Domain customization is done through a combination of generic replacements + domain-specific relacements.
By connecting replacements with each other based on their token positions, we can construct a direct acyclic graph.
The output of normalization can only be produced by a legal assignment, where a legal assignment must be both sound and complete
Essentially, each legal assignment corresponds to a path from start to end.
We appeal to the log-linear model formulation to define the probability of an assignment.
The probability of an assignment depends on the input as well as the weight vector of the features.
When performing inference, we wish to select the output sequence with the highest probability
The goal of learning is to compute the weights of our features. We use a perceptron-style algorithm to do learning. The idea is to update the weights over iterations to minimize the difference between the truth path and the inference path.
Here is a simple demo for one iteration of learning. From gold, I know the black path is the truth path. But w/ current weights, my inference tell me the blue path is the best path.
My hope is that my inference path could move toward the truth path. So the nature thing to do is to decrease the weights in blue boxes, because they only appear in inference path, and increase the weights in the purple boxes, because they only appear in the truth path. Then the update makes truth path longer in the model, and will be picked by our algorithm.
We use features of four major sources
N-gram features indicate the frequency of the phrases induced by an edge.
POS information can be used to produce features that encourage certain behavior, such as …
Information from positions is used primarily to handle capitalization and punctuation insertion.
Finally, we include binary features that indicate which generator spawned the replacement.
We propose a new evaluation metric that directly equates normalization performance with the performance of a common downstream application --- dependency parsing.