SlideShare uma empresa Scribd logo
1 de 21
November 10, 2006 I2B2 - Smoker Status Challenge 1
Determining Smoker Status using
Supervised and Unsupervised
Learning with Lexical Features
Ted Pedersen
University of Minnesota, Duluth
tpederse@d.umn.edu
http://www.d.umn.edu/~tpederse
November 10, 2006 I2B2 - Smoker Status Challenge 2
Approaches
• Smoking Status as Text Classification
– supervised learning
– lexical features
– techniques used to good effect in word sense
disambiguation
• Smoking Status as Text Clustering
– unsupervised learning
– lexical features
– techniques used to good effect in word sense
discrimination
November 10, 2006 I2B2 - Smoker Status Challenge 3
Objectives
• How well do WSD techniques generalize to
related but different problems?
– smoking status as "meaning" of record??
– not quite the same problem…
• How well do WSD features generalize?
– bag of words, unigrams
– bigrams
– collocations
• How well do learning algorithms generalize?
– supervised and unsupervised
November 10, 2006 I2B2 - Smoker Status Challenge 4
Experimental Variations
Supervised Learning
• Learning Algorithm
– naïve Bayesian classifier
– J48 decision tree
– support vector machine (SMO)
• Feature Sets (also used in unsupervised)
– unigrams, bigrams, trigrams
– various frequency and measure of association cutoffs
– Stop List of 472 words
• 392 function words
• 80 words that occurred in more than half the records
November 10, 2006 I2B2 - Smoker Status Challenge 5
Decision Tree
• J48 most accurate when using unigram
features that occurred 5 or more times in
the training data
– over 3,600 unigrams as candidate features
– decision tree has 47 nodes and 24 leaves
– accuracy of 82% (327/401)
November 10, 2006 I2B2 - Smoker Status Challenge 6
Decision Tree
unigrams : 5 or more times
November 10, 2006 I2B2 - Smoker Status Challenge 7
82% accuracy (327/398)
10-fold cross validation on train
a b c d e <-- classified as
20 5 1 7 3 | a = PAST-SMOKER
8 46 3 8 1 | b = NON-SMOKER
8 2 240 2 0 | c = UNKNOWN
7 5 1 21 1 | d = CURR-
SMOKER
1 3 1 4 0 | e = SMOKER
November 10, 2006 I2B2 - Smoker Status Challenge 8
Manual Inspection
• From the decision tree learned from the
3,600 features, we decided to use the
following in a second experiment:
– cigarette, drinks, quit, smoke, smoked,
smoker, smokes, smoking, tobacco
November 10, 2006 I2B2 - Smoker Status Challenge 9
9-feature Decision Tree
selected from unigram tree
• quit = 0
• | smoking = 0
• | | smoker = 0
• | | | tobacco = 0
• | | | | smoke = 0
• | | | | | drinks = 0
• | | | | | | cigarette = 0
• | | | | | | | smoked = 0: UNKNOWN (253.0/3.0)
• | | | | | | | smoked = 1: PAST-SMOKER (2.0/1.0)
• | | | | | | cigarette = 1: NON-SMOKER (3.0/1.0)
• | | | | | drinks = 1: NON-SMOKER (6.0/3.0)
• | | | | smoke = 1: NON-SMOKER (16.0)
• | | | tobacco = 1
• | | | | smokes = 0: NON-SMOKER (39.0/7.0)
• | | | | smokes = 1: CURRENT-SMOKER (2.0)
• | | smoker = 1: CURRENT-SMOKER (11.0/5.0)
• | smoking = 1: CURRENT-SMOKER (42.0/22.0)
• quit = 1: PAST-SMOKER (24.0/4.0)
November 10, 2006 I2B2 - Smoker Status Challenge 10
9-feature Decision Tree
87% accuracy (345/398)
10 fold cross validation on train
a b c d e <-- classified as
20 5 1 10 0 | a = PAST-SMOKER
0 51 2 13 0 | b = NON-SMOKER
0 1 250 1 0 | c = UNKNOWN
5 4 2 24 0 | d = CURR-SMOKER
0 3 1 5 0 | e = SMOKER
November 10, 2006 I2B2 - Smoker Status Challenge 11
9-feature Decision Tree
82% accuracy (85/104)
evaluation data
a b c d e <-- classified as
62 0 1 0 0 | a = UNKNOWN
1 10 1 0 4 | b = NON-SMOKER
0 2 4 0 5 | c = PAST-SMOKER
0 0 0 0 3 | d = SMOKER
0 1 1 0 9 | e = CURR-SMOKER
November 10, 2006 I2B2 - Smoker Status Challenge 12
9-feature Decision Tree
90% accuracy (94/104)
evaluation data
a b f <-- classified as
62 0 1 | a = UNKNOWN
1 10 5 | b = NON-SMOKER
0 3 22 | f = ALL-SMOKER
November 10, 2006 I2B2 - Smoker Status Challenge 13
Unsupervised Experiments
• Bigram Features
– allow up to 5 intervening words
– occur 2 or more times in training data
– limit to those that include "smok" --> 96 features
– social smoking, pack smoking, smoking alcohol,
smoking family, smoke drink, cigarette smoking,
allergies smoking, allergies smoked, smoking quit,
quit smoking, smoker drinks, former smoker, social
smoke, denies smoking, habits smoking, ...
November 10, 2006 I2B2 - Smoker Status Challenge 14
Unsupervised
Context Representations
• 2nd
order Context Representations
– Latent Semantic Analysis, native SenseClusters
– each record represented by a vector that is the
average of vectors that represent the individual
features :
• LSA
– each bigram is replaced by a vector showing the
records in which it occurs
• native SenseClusters
– each word is replaced by a vector showing the
second words it occurs with as a bigram
November 10, 2006 I2B2 - Smoker Status Challenge 15
Unsupervised Clustering
• Once vectors for all records are created, they
are clustered using a partitional method similar
to k-means
• The number of clusters is automatically
discovered using the PK2 measure, which
compares successive values of clustering
criterion function
• assign clusters to categories based on
distribution in training data
– unknown, non-smoker, past-smoker,
current-smoker, smoker
November 10, 2006 I2B2 - Smoker Status Challenge 16
SenseClusters
69% accuracy (72/104)
evaluation data
a b c d e <-- classified as
63 0 0 0 0 | a = UNKNOWN
10 0 0 0 6 | b = NON-SMOKER
2 1 0 0 8 | c = PAST-SMOKER
1 0 0 0 2 | d = SMOKER
2 0 0 0 9 | e = CURR-SMOKER
November 10, 2006 I2B2 - Smoker Status Challenge 17
SenseClusters
79% accuracy (82/104)
evaluation data
a b f <-- classified as
63 0 0 | a = UNKNOWN
10 0 6 | b = NON-SMOKER
5 1 19 | f = ALL-SMOKER
November 10, 2006 I2B2 - Smoker Status Challenge 18
Latent Semantic Analysis
68% accuracy (71/104)
evaluation data
a b c d e <-- classified as
63 0 0 0 0 | a = UNKNOWN
10 0 0 0 6 | b = NON-SMOKER
1 3 0 0 7 | c = PAST-SMOKER
1 0 0 0 2 | d = SMOKER
2 1 0 0 8 | e = CURR-SMOKER
November 10, 2006 I2B2 - Smoker Status Challenge 19
Latent Semantic Analysis
77% accuracy (80/104)
evaluation data
a b f <-- classified as
63 0 0 | a = UNKNOWN
10 0 6 | b = NON-SMOKER
4 4 17 | f = ALL-SMOKER
November 10, 2006 I2B2 - Smoker Status Challenge 20
Conclusions
• Results dominated by UNKNOWN
– sets lower bound of 61%
• Errors dominated by confusion in ALL-SMOKER
– reduction to 3 classes improves results significantly
• Decision tree aided feature selection
• Manual tuning of feature sets performed since
records focus well beyond smoking status
• Unsupervised clustering found "right" number of
clusters perhaps, did well in that light
November 10, 2006 I2B2 - Smoker Status Challenge 21
Software Resources
• Supervised Experiments
– SenseTools (free, from Duluth)
http://www.d.umn.edu/~tpederse/sensetools.html
– Weka (free, from Waikato)
http://www.cs.waikato.ac.nz/ml/weka/
• Unsupervised Experiments
– SenseClusters (free, from Duluth)
http://senseclusters.sourceforge.net

Mais conteúdo relacionado

Destaque (8)

Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
Measuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and ConceptsMeasuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and Concepts
 
Amia06
Amia06Amia06
Amia06
 
The road from good software engineering to good science...is a two way street
The road from good software engineering to good science...is a two way streetThe road from good software engineering to good science...is a two way street
The road from good software engineering to good science...is a two way street
 
Presentation.Pit.2011 02 04.Lat.Dianak
Presentation.Pit.2011 02 04.Lat.DianakPresentation.Pit.2011 02 04.Lat.Dianak
Presentation.Pit.2011 02 04.Lat.Dianak
 
Advances In Wsd Aaai 2005
Advances In Wsd Aaai 2005Advances In Wsd Aaai 2005
Advances In Wsd Aaai 2005
 
Icon 2007 Pedersen
Icon 2007 PedersenIcon 2007 Pedersen
Icon 2007 Pedersen
 
A Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM AlgorithmA Gentle Introduction to the EM Algorithm
A Gentle Introduction to the EM Algorithm
 

Mais de University of Minnesota, Duluth

Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...University of Minnesota, Duluth
 
Algorithmic Bias - What is it? Why should we care? What can we do about it?
Algorithmic Bias - What is it? Why should we care? What can we do about it? Algorithmic Bias - What is it? Why should we care? What can we do about it?
Algorithmic Bias - What is it? Why should we care? What can we do about it? University of Minnesota, Duluth
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?University of Minnesota, Duluth
 
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection University of Minnesota, Duluth
 
Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...University of Minnesota, Duluth
 
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...University of Minnesota, Duluth
 
Puns upon a midnight dreary, lexical semantics for the weak and weary
Puns upon a midnight dreary, lexical semantics for the weak and wearyPuns upon a midnight dreary, lexical semantics for the weak and weary
Puns upon a midnight dreary, lexical semantics for the weak and wearyUniversity of Minnesota, Duluth
 
The horizon isn't found in a dictionary : Identifying emerging word senses a...
The horizon isn't found in a  dictionary : Identifying emerging word senses a...The horizon isn't found in a  dictionary : Identifying emerging word senses a...
The horizon isn't found in a dictionary : Identifying emerging word senses a...University of Minnesota, Duluth
 
Duluth : Word Sense Discrimination in the Service of Lexicography
Duluth : Word Sense Discrimination in the Service of LexicographyDuluth : Word Sense Discrimination in the Service of Lexicography
Duluth : Word Sense Discrimination in the Service of LexicographyUniversity of Minnesota, Duluth
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...University of Minnesota, Duluth
 
What it's like to do a Master's thesis with me (Ted Pedersen)
What it's like to do a Master's thesis with me (Ted Pedersen)What it's like to do a Master's thesis with me (Ted Pedersen)
What it's like to do a Master's thesis with me (Ted Pedersen)University of Minnesota, Duluth
 

Mais de University of Minnesota, Duluth (20)

Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
Muslims in Machine Learning workshop (NeurlPS 2021) - Automatically Identifyi...
 
Automatically Identifying Islamophobia in Social Media
Automatically Identifying Islamophobia in Social MediaAutomatically Identifying Islamophobia in Social Media
Automatically Identifying Islamophobia in Social Media
 
What Makes Hate Speech : an interactive workshop
What Makes Hate Speech : an interactive workshopWhat Makes Hate Speech : an interactive workshop
What Makes Hate Speech : an interactive workshop
 
Algorithmic Bias - What is it? Why should we care? What can we do about it?
Algorithmic Bias - What is it? Why should we care? What can we do about it? Algorithmic Bias - What is it? Why should we care? What can we do about it?
Algorithmic Bias - What is it? Why should we care? What can we do about it?
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?
 
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
Duluth at Semeval 2017 Task 6 - Language Models in Humor Detection
 
Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...Who's to say what's funny? A computer using Language Models and Deep Learning...
Who's to say what's funny? A computer using Language Models and Deep Learning...
 
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
Duluth at Semeval 2017 Task 7 - Puns upon a Midnight Dreary, Lexical Semantic...
 
Puns upon a midnight dreary, lexical semantics for the weak and weary
Puns upon a midnight dreary, lexical semantics for the weak and wearyPuns upon a midnight dreary, lexical semantics for the weak and weary
Puns upon a midnight dreary, lexical semantics for the weak and weary
 
The horizon isn't found in a dictionary : Identifying emerging word senses a...
The horizon isn't found in a  dictionary : Identifying emerging word senses a...The horizon isn't found in a  dictionary : Identifying emerging word senses a...
The horizon isn't found in a dictionary : Identifying emerging word senses a...
 
Screening Twitter Users for Depression and PTSD
Screening Twitter Users for Depression and PTSDScreening Twitter Users for Depression and PTSD
Screening Twitter Users for Depression and PTSD
 
Duluth : Word Sense Discrimination in the Service of Lexicography
Duluth : Word Sense Discrimination in the Service of LexicographyDuluth : Word Sense Discrimination in the Service of Lexicography
Duluth : Word Sense Discrimination in the Service of Lexicography
 
Pedersen masters-thesis-oct-10-2014
Pedersen masters-thesis-oct-10-2014Pedersen masters-thesis-oct-10-2014
Pedersen masters-thesis-oct-10-2014
 
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Conc...
 
What it's like to do a Master's thesis with me (Ted Pedersen)
What it's like to do a Master's thesis with me (Ted Pedersen)What it's like to do a Master's thesis with me (Ted Pedersen)
What it's like to do a Master's thesis with me (Ted Pedersen)
 
Pedersen naacl-2013-demo-poster-may25
Pedersen naacl-2013-demo-poster-may25Pedersen naacl-2013-demo-poster-may25
Pedersen naacl-2013-demo-poster-may25
 
Pedersen semeval-2013-poster-may24
Pedersen semeval-2013-poster-may24Pedersen semeval-2013-poster-may24
Pedersen semeval-2013-poster-may24
 
Talk at UAB, April 12, 2013
Talk at UAB, April 12, 2013Talk at UAB, April 12, 2013
Talk at UAB, April 12, 2013
 
Feb20 mayo-webinar-21feb2012
Feb20 mayo-webinar-21feb2012Feb20 mayo-webinar-21feb2012
Feb20 mayo-webinar-21feb2012
 
Ihi2012 semantic-similarity-tutorial-part1
Ihi2012 semantic-similarity-tutorial-part1Ihi2012 semantic-similarity-tutorial-part1
Ihi2012 semantic-similarity-tutorial-part1
 

Último

Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipurparulsinha
 
Russian Call Girls in Jaipur Riya WhatsApp ❤8445551418 VIP Call Girls Jaipur
Russian Call Girls in Jaipur Riya WhatsApp ❤8445551418 VIP Call Girls JaipurRussian Call Girls in Jaipur Riya WhatsApp ❤8445551418 VIP Call Girls Jaipur
Russian Call Girls in Jaipur Riya WhatsApp ❤8445551418 VIP Call Girls Jaipurparulsinha
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Dipal Arora
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escortsvidya singh
 
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...aartirawatdelhi
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...astropune
 
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...perfect solution
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escortsaditipandeya
 
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Dipal Arora
 
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...narwatsonia7
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Call Girls in Nagpur High Profile
 
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Servicevidya singh
 
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋TANUJA PANDEY
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...Garima Khatri
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...astropune
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...hotbabesbook
 

Último (20)

Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls JaipurCall Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
Call Girls Service Jaipur Grishma WhatsApp ❤8445551418 VIP Call Girls Jaipur
 
Russian Call Girls in Jaipur Riya WhatsApp ❤8445551418 VIP Call Girls Jaipur
Russian Call Girls in Jaipur Riya WhatsApp ❤8445551418 VIP Call Girls JaipurRussian Call Girls in Jaipur Riya WhatsApp ❤8445551418 VIP Call Girls Jaipur
Russian Call Girls in Jaipur Riya WhatsApp ❤8445551418 VIP Call Girls Jaipur
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
Call Girls Bhubaneswar Just Call 9907093804 Top Class Call Girl Service Avail...
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
 
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
 
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
 
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore EscortsVIP Call Girls Indore Kirti 💚😋  9256729539 🚀 Indore Escorts
VIP Call Girls Indore Kirti 💚😋 9256729539 🚀 Indore Escorts
 
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
 
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
Top Rated Bangalore Call Girls Richmond Circle ⟟ 8250192130 ⟟ Call Me For Gen...
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
 
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
VIP Hyderabad Call Girls Bahadurpally 7877925207 ₹5000 To 25K With AC Room 💚😋
 
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
VIP Mumbai Call Girls Hiranandani Gardens Just Call 9920874524 with A/C Room ...
 
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
♛VVIP Hyderabad Call Girls Chintalkunta🖕7001035870🖕Riya Kappor Top Call Girl ...
 
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
Night 7k to 12k Chennai City Center Call Girls 👉👉 7427069034⭐⭐ 100% Genuine E...
 

I2 B2 2006 Pedersen

  • 1. November 10, 2006 I2B2 - Smoker Status Challenge 1 Determining Smoker Status using Supervised and Unsupervised Learning with Lexical Features Ted Pedersen University of Minnesota, Duluth tpederse@d.umn.edu http://www.d.umn.edu/~tpederse
  • 2. November 10, 2006 I2B2 - Smoker Status Challenge 2 Approaches • Smoking Status as Text Classification – supervised learning – lexical features – techniques used to good effect in word sense disambiguation • Smoking Status as Text Clustering – unsupervised learning – lexical features – techniques used to good effect in word sense discrimination
  • 3. November 10, 2006 I2B2 - Smoker Status Challenge 3 Objectives • How well do WSD techniques generalize to related but different problems? – smoking status as "meaning" of record?? – not quite the same problem… • How well do WSD features generalize? – bag of words, unigrams – bigrams – collocations • How well do learning algorithms generalize? – supervised and unsupervised
  • 4. November 10, 2006 I2B2 - Smoker Status Challenge 4 Experimental Variations Supervised Learning • Learning Algorithm – naïve Bayesian classifier – J48 decision tree – support vector machine (SMO) • Feature Sets (also used in unsupervised) – unigrams, bigrams, trigrams – various frequency and measure of association cutoffs – Stop List of 472 words • 392 function words • 80 words that occurred in more than half the records
  • 5. November 10, 2006 I2B2 - Smoker Status Challenge 5 Decision Tree • J48 most accurate when using unigram features that occurred 5 or more times in the training data – over 3,600 unigrams as candidate features – decision tree has 47 nodes and 24 leaves – accuracy of 82% (327/401)
  • 6. November 10, 2006 I2B2 - Smoker Status Challenge 6 Decision Tree unigrams : 5 or more times
  • 7. November 10, 2006 I2B2 - Smoker Status Challenge 7 82% accuracy (327/398) 10-fold cross validation on train a b c d e <-- classified as 20 5 1 7 3 | a = PAST-SMOKER 8 46 3 8 1 | b = NON-SMOKER 8 2 240 2 0 | c = UNKNOWN 7 5 1 21 1 | d = CURR- SMOKER 1 3 1 4 0 | e = SMOKER
  • 8. November 10, 2006 I2B2 - Smoker Status Challenge 8 Manual Inspection • From the decision tree learned from the 3,600 features, we decided to use the following in a second experiment: – cigarette, drinks, quit, smoke, smoked, smoker, smokes, smoking, tobacco
  • 9. November 10, 2006 I2B2 - Smoker Status Challenge 9 9-feature Decision Tree selected from unigram tree • quit = 0 • | smoking = 0 • | | smoker = 0 • | | | tobacco = 0 • | | | | smoke = 0 • | | | | | drinks = 0 • | | | | | | cigarette = 0 • | | | | | | | smoked = 0: UNKNOWN (253.0/3.0) • | | | | | | | smoked = 1: PAST-SMOKER (2.0/1.0) • | | | | | | cigarette = 1: NON-SMOKER (3.0/1.0) • | | | | | drinks = 1: NON-SMOKER (6.0/3.0) • | | | | smoke = 1: NON-SMOKER (16.0) • | | | tobacco = 1 • | | | | smokes = 0: NON-SMOKER (39.0/7.0) • | | | | smokes = 1: CURRENT-SMOKER (2.0) • | | smoker = 1: CURRENT-SMOKER (11.0/5.0) • | smoking = 1: CURRENT-SMOKER (42.0/22.0) • quit = 1: PAST-SMOKER (24.0/4.0)
  • 10. November 10, 2006 I2B2 - Smoker Status Challenge 10 9-feature Decision Tree 87% accuracy (345/398) 10 fold cross validation on train a b c d e <-- classified as 20 5 1 10 0 | a = PAST-SMOKER 0 51 2 13 0 | b = NON-SMOKER 0 1 250 1 0 | c = UNKNOWN 5 4 2 24 0 | d = CURR-SMOKER 0 3 1 5 0 | e = SMOKER
  • 11. November 10, 2006 I2B2 - Smoker Status Challenge 11 9-feature Decision Tree 82% accuracy (85/104) evaluation data a b c d e <-- classified as 62 0 1 0 0 | a = UNKNOWN 1 10 1 0 4 | b = NON-SMOKER 0 2 4 0 5 | c = PAST-SMOKER 0 0 0 0 3 | d = SMOKER 0 1 1 0 9 | e = CURR-SMOKER
  • 12. November 10, 2006 I2B2 - Smoker Status Challenge 12 9-feature Decision Tree 90% accuracy (94/104) evaluation data a b f <-- classified as 62 0 1 | a = UNKNOWN 1 10 5 | b = NON-SMOKER 0 3 22 | f = ALL-SMOKER
  • 13. November 10, 2006 I2B2 - Smoker Status Challenge 13 Unsupervised Experiments • Bigram Features – allow up to 5 intervening words – occur 2 or more times in training data – limit to those that include "smok" --> 96 features – social smoking, pack smoking, smoking alcohol, smoking family, smoke drink, cigarette smoking, allergies smoking, allergies smoked, smoking quit, quit smoking, smoker drinks, former smoker, social smoke, denies smoking, habits smoking, ...
  • 14. November 10, 2006 I2B2 - Smoker Status Challenge 14 Unsupervised Context Representations • 2nd order Context Representations – Latent Semantic Analysis, native SenseClusters – each record represented by a vector that is the average of vectors that represent the individual features : • LSA – each bigram is replaced by a vector showing the records in which it occurs • native SenseClusters – each word is replaced by a vector showing the second words it occurs with as a bigram
  • 15. November 10, 2006 I2B2 - Smoker Status Challenge 15 Unsupervised Clustering • Once vectors for all records are created, they are clustered using a partitional method similar to k-means • The number of clusters is automatically discovered using the PK2 measure, which compares successive values of clustering criterion function • assign clusters to categories based on distribution in training data – unknown, non-smoker, past-smoker, current-smoker, smoker
  • 16. November 10, 2006 I2B2 - Smoker Status Challenge 16 SenseClusters 69% accuracy (72/104) evaluation data a b c d e <-- classified as 63 0 0 0 0 | a = UNKNOWN 10 0 0 0 6 | b = NON-SMOKER 2 1 0 0 8 | c = PAST-SMOKER 1 0 0 0 2 | d = SMOKER 2 0 0 0 9 | e = CURR-SMOKER
  • 17. November 10, 2006 I2B2 - Smoker Status Challenge 17 SenseClusters 79% accuracy (82/104) evaluation data a b f <-- classified as 63 0 0 | a = UNKNOWN 10 0 6 | b = NON-SMOKER 5 1 19 | f = ALL-SMOKER
  • 18. November 10, 2006 I2B2 - Smoker Status Challenge 18 Latent Semantic Analysis 68% accuracy (71/104) evaluation data a b c d e <-- classified as 63 0 0 0 0 | a = UNKNOWN 10 0 0 0 6 | b = NON-SMOKER 1 3 0 0 7 | c = PAST-SMOKER 1 0 0 0 2 | d = SMOKER 2 1 0 0 8 | e = CURR-SMOKER
  • 19. November 10, 2006 I2B2 - Smoker Status Challenge 19 Latent Semantic Analysis 77% accuracy (80/104) evaluation data a b f <-- classified as 63 0 0 | a = UNKNOWN 10 0 6 | b = NON-SMOKER 4 4 17 | f = ALL-SMOKER
  • 20. November 10, 2006 I2B2 - Smoker Status Challenge 20 Conclusions • Results dominated by UNKNOWN – sets lower bound of 61% • Errors dominated by confusion in ALL-SMOKER – reduction to 3 classes improves results significantly • Decision tree aided feature selection • Manual tuning of feature sets performed since records focus well beyond smoking status • Unsupervised clustering found "right" number of clusters perhaps, did well in that light
  • 21. November 10, 2006 I2B2 - Smoker Status Challenge 21 Software Resources • Supervised Experiments – SenseTools (free, from Duluth) http://www.d.umn.edu/~tpederse/sensetools.html – Weka (free, from Waikato) http://www.cs.waikato.ac.nz/ml/weka/ • Unsupervised Experiments – SenseClusters (free, from Duluth) http://senseclusters.sourceforge.net