Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
I2 B2 2006 Pedersen
1. November 10, 2006 I2B2 - Smoker Status Challenge 1
Determining Smoker Status using
Supervised and Unsupervised
Learning with Lexical Features
Ted Pedersen
University of Minnesota, Duluth
tpederse@d.umn.edu
http://www.d.umn.edu/~tpederse
2. November 10, 2006 I2B2 - Smoker Status Challenge 2
Approaches
• Smoking Status as Text Classification
– supervised learning
– lexical features
– techniques used to good effect in word sense
disambiguation
• Smoking Status as Text Clustering
– unsupervised learning
– lexical features
– techniques used to good effect in word sense
discrimination
3. November 10, 2006 I2B2 - Smoker Status Challenge 3
Objectives
• How well do WSD techniques generalize to
related but different problems?
– smoking status as "meaning" of record??
– not quite the same problem…
• How well do WSD features generalize?
– bag of words, unigrams
– bigrams
– collocations
• How well do learning algorithms generalize?
– supervised and unsupervised
4. November 10, 2006 I2B2 - Smoker Status Challenge 4
Experimental Variations
Supervised Learning
• Learning Algorithm
– naïve Bayesian classifier
– J48 decision tree
– support vector machine (SMO)
• Feature Sets (also used in unsupervised)
– unigrams, bigrams, trigrams
– various frequency and measure of association cutoffs
– Stop List of 472 words
• 392 function words
• 80 words that occurred in more than half the records
5. November 10, 2006 I2B2 - Smoker Status Challenge 5
Decision Tree
• J48 most accurate when using unigram
features that occurred 5 or more times in
the training data
– over 3,600 unigrams as candidate features
– decision tree has 47 nodes and 24 leaves
– accuracy of 82% (327/401)
6. November 10, 2006 I2B2 - Smoker Status Challenge 6
Decision Tree
unigrams : 5 or more times
7. November 10, 2006 I2B2 - Smoker Status Challenge 7
82% accuracy (327/398)
10-fold cross validation on train
a b c d e <-- classified as
20 5 1 7 3 | a = PAST-SMOKER
8 46 3 8 1 | b = NON-SMOKER
8 2 240 2 0 | c = UNKNOWN
7 5 1 21 1 | d = CURR-
SMOKER
1 3 1 4 0 | e = SMOKER
8. November 10, 2006 I2B2 - Smoker Status Challenge 8
Manual Inspection
• From the decision tree learned from the
3,600 features, we decided to use the
following in a second experiment:
– cigarette, drinks, quit, smoke, smoked,
smoker, smokes, smoking, tobacco
10. November 10, 2006 I2B2 - Smoker Status Challenge 10
9-feature Decision Tree
87% accuracy (345/398)
10 fold cross validation on train
a b c d e <-- classified as
20 5 1 10 0 | a = PAST-SMOKER
0 51 2 13 0 | b = NON-SMOKER
0 1 250 1 0 | c = UNKNOWN
5 4 2 24 0 | d = CURR-SMOKER
0 3 1 5 0 | e = SMOKER
11. November 10, 2006 I2B2 - Smoker Status Challenge 11
9-feature Decision Tree
82% accuracy (85/104)
evaluation data
a b c d e <-- classified as
62 0 1 0 0 | a = UNKNOWN
1 10 1 0 4 | b = NON-SMOKER
0 2 4 0 5 | c = PAST-SMOKER
0 0 0 0 3 | d = SMOKER
0 1 1 0 9 | e = CURR-SMOKER
12. November 10, 2006 I2B2 - Smoker Status Challenge 12
9-feature Decision Tree
90% accuracy (94/104)
evaluation data
a b f <-- classified as
62 0 1 | a = UNKNOWN
1 10 5 | b = NON-SMOKER
0 3 22 | f = ALL-SMOKER
13. November 10, 2006 I2B2 - Smoker Status Challenge 13
Unsupervised Experiments
• Bigram Features
– allow up to 5 intervening words
– occur 2 or more times in training data
– limit to those that include "smok" --> 96 features
– social smoking, pack smoking, smoking alcohol,
smoking family, smoke drink, cigarette smoking,
allergies smoking, allergies smoked, smoking quit,
quit smoking, smoker drinks, former smoker, social
smoke, denies smoking, habits smoking, ...
14. November 10, 2006 I2B2 - Smoker Status Challenge 14
Unsupervised
Context Representations
• 2nd
order Context Representations
– Latent Semantic Analysis, native SenseClusters
– each record represented by a vector that is the
average of vectors that represent the individual
features :
• LSA
– each bigram is replaced by a vector showing the
records in which it occurs
• native SenseClusters
– each word is replaced by a vector showing the
second words it occurs with as a bigram
15. November 10, 2006 I2B2 - Smoker Status Challenge 15
Unsupervised Clustering
• Once vectors for all records are created, they
are clustered using a partitional method similar
to k-means
• The number of clusters is automatically
discovered using the PK2 measure, which
compares successive values of clustering
criterion function
• assign clusters to categories based on
distribution in training data
– unknown, non-smoker, past-smoker,
current-smoker, smoker
16. November 10, 2006 I2B2 - Smoker Status Challenge 16
SenseClusters
69% accuracy (72/104)
evaluation data
a b c d e <-- classified as
63 0 0 0 0 | a = UNKNOWN
10 0 0 0 6 | b = NON-SMOKER
2 1 0 0 8 | c = PAST-SMOKER
1 0 0 0 2 | d = SMOKER
2 0 0 0 9 | e = CURR-SMOKER
17. November 10, 2006 I2B2 - Smoker Status Challenge 17
SenseClusters
79% accuracy (82/104)
evaluation data
a b f <-- classified as
63 0 0 | a = UNKNOWN
10 0 6 | b = NON-SMOKER
5 1 19 | f = ALL-SMOKER
18. November 10, 2006 I2B2 - Smoker Status Challenge 18
Latent Semantic Analysis
68% accuracy (71/104)
evaluation data
a b c d e <-- classified as
63 0 0 0 0 | a = UNKNOWN
10 0 0 0 6 | b = NON-SMOKER
1 3 0 0 7 | c = PAST-SMOKER
1 0 0 0 2 | d = SMOKER
2 1 0 0 8 | e = CURR-SMOKER
19. November 10, 2006 I2B2 - Smoker Status Challenge 19
Latent Semantic Analysis
77% accuracy (80/104)
evaluation data
a b f <-- classified as
63 0 0 | a = UNKNOWN
10 0 6 | b = NON-SMOKER
4 4 17 | f = ALL-SMOKER
20. November 10, 2006 I2B2 - Smoker Status Challenge 20
Conclusions
• Results dominated by UNKNOWN
– sets lower bound of 61%
• Errors dominated by confusion in ALL-SMOKER
– reduction to 3 classes improves results significantly
• Decision tree aided feature selection
• Manual tuning of feature sets performed since
records focus well beyond smoking status
• Unsupervised clustering found "right" number of
clusters perhaps, did well in that light
21. November 10, 2006 I2B2 - Smoker Status Challenge 21
Software Resources
• Supervised Experiments
– SenseTools (free, from Duluth)
http://www.d.umn.edu/~tpederse/sensetools.html
– Weka (free, from Waikato)
http://www.cs.waikato.ac.nz/ml/weka/
• Unsupervised Experiments
– SenseClusters (free, from Duluth)
http://senseclusters.sourceforge.net