SlideShare uma empresa Scribd logo
1 de 27
Baixar para ler offline
Text Classification and Clustering
with
WEKAWEKA
A guided example by
Sergio Jiménez
The Task
Building a model for movies revisions in English
for classifying it into positive or negative.
Sentiment Polarity Dataset Version 2.0
1000 positive movie review and 1000 negative review texts from:
Thumbs up? Sentiment Classification using Machine Learning
Techniques. Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
Proceedings of EMNLP, pp. 79--86, 2002.
“Our data source was the Internet Movie Database (IMDb) archive of“Our data source was the Internet Movie Database (IMDb) archive of
the rec.arts.movies.reviews newsgroup.3 We selected only reviews
where the author rating was expressed either with stars or some
numerical value (other conventions varied too widely to allow for
automatic processing). Ratings were automatically extracted and
converted into one of three categories: positive, negative, or
neutral. For the work described in this paper, we concentrated only
on discriminating between positive and negative sentiment.”
http://www.cs.cornell.edu/people/pabo/movie-review-data/
The Data (1/2)
The Data (2/2)
0
50
100
150
200
250
300
350
#Documents
1000 negative revisions histogram
0
# characters
0
50
100
150
200
250
300
#Documents
# characters
1000 positive revisions histogram
What WEKA is?
• “Weka is a collection of machine learning
algorithms for data mining tasks”.
• “Weka contains tools for:
– data pre-processing,
– classification,
– regression,
– clustering,
– association rules,
– and visualization”
Where to start?
Getting WEKA
Before Running WEKA
Increasing available memory for Java in RunWeka.ini
Change
maxheap=256m
to
maxheap=1024m
Running WEKA
using
“RunWeka.bat”
Creating a .arff dataset
Saving the .arff dataset
From text to vectors
],,,,,[ 321 classvvvvV nL=
review1=“great movie”
review2=“excellent film”
review3=“worst film ever”
review4=“sucks”
excellent
],0,0,1,1,0,0,0[1 +=V
],0,0,0,0,1,1,0[2 +=V
],1,0,0,0,1,0,1[3 −=V
],0,1,0,0,0,0,0[4 −=V
ever
excellent
film
great
movie
sucks
worst
Converting to Vector Space Model
Edit “movie_reviews.arff”
and change “class” to
“class1”. Apply the filter
again after the change.
Visualize the vector data
StringToWordVector filter options
lowerCase convertion
TF-IDF weigthing
Stopwords removal using a list
of words in a file
Stemming
Use frequencies instead of
single presence
Generating datasets for experiments
dataset file name Stopwords Stemming
Presence or
freq.
movie_reviews_1.arff no presence
movie_reviews_2.arff no frequencymovie_reviews_2.arff no frequency
movie_reviews_3.arff yes presence
movie_reviews_4.arff yes frequency
movie_reviews_5.arff removed no presence
movie_reviews_6.arff removed no frequency
movie_reviews_7.arff removed yes presence
movie_reviews_8.arff removed yes frequency
Classifying ReviewsClick!
Select number
Select a
classifier
Select class
attribute
Select number
of folds
Start !
Results
Results Correctly Classified Reviews
dataset name Stopwords Stemming
Presence
or freq.
Naive
Bayes 3-
fold
NaiveBayes
Multinomial
3-fold
movie_reviews_1.arff no presence 80.65% 83.80%
movie_reviews_2.arff no frequencymovie_reviews_2.arff no frequency 69.30% 78.65%
movie_reviews_3.arff yes presence 79.40% 82.15%
movie_reviews_4.arff yes frequency 68.10% 79.70%
movie_reviews_5.arff removed no presence 81.80% 84.35%
movie_reviews_6.arff removed no frequency 69.40% 81.75%
movie_reviews_7.arff removed yes presence 78.90% 82.40%
movie_reviews_8.arff removed yes frequency 68.30% 80.50%
Attribute (word) Selecction
Choose an Attribute
Selection Algorithm
Select the
class attribute
Selected Attributes (words)
also
awful
bad
boring
both
dull
fails
pointless
poor
ridiculous
script
seagal
sometimes
stupid
deserves
effective
flaws
greatest
hilarious
memorable
overall
great
joke
lame
life
many
maybe
mess
nothing
others
perfect
performances
stupid
tale
terrible
true
visual
waste
wasted
world
worst
animation
definitely
overall
perfectly
realistic
share
solid
subtle
terrific
unlike
view
wonderfully
Pruned movie_reviews_1.arff dataset
Naïve Bayes with the pruned dataset
Clustering
Correctly clustered instances: 65.25%
Other results
Results of Pang et al. (2002) with version 1.0 of the dataset with 700+ and 700-
Thanks

Mais conteúdo relacionado

Destaque

L中国古代风俗百图
L中国古代风俗百图L中国古代风俗百图
L中国古代风俗百图
psjlew
 
Les presentation 24 10 2013
Les presentation 24 10 2013Les presentation 24 10 2013
Les presentation 24 10 2013
Anneke Weber
 
木製長橋
木製長橋木製長橋
木製長橋
psjlew
 
世界四大博物館 法俄英美1308
世界四大博物館 法俄英美1308世界四大博物館 法俄英美1308
世界四大博物館 法俄英美1308
psjlew
 
美聯社2011年度新聞圖片精選
美聯社2011年度新聞圖片精選美聯社2011年度新聞圖片精選
美聯社2011年度新聞圖片精選
psjlew
 
Week 4: Algorithmic Game Theory Notes
Week 4: Algorithmic Game Theory NotesWeek 4: Algorithmic Game Theory Notes
Week 4: Algorithmic Game Theory Notes
Dongseo University
 
外國人鏡頭下的八國聯軍
外國人鏡頭下的八國聯軍外國人鏡頭下的八國聯軍
外國人鏡頭下的八國聯軍
psjlew
 
情繫俄羅斯(上):莫斯科 Moscow
情繫俄羅斯(上):莫斯科 Moscow情繫俄羅斯(上):莫斯科 Moscow
情繫俄羅斯(上):莫斯科 Moscow
psjlew
 
最新科技成果
最新科技成果最新科技成果
最新科技成果
psjlew
 
Fun joke 解鬱集_一年份的笑話_52
Fun joke 解鬱集_一年份的笑話_52Fun joke 解鬱集_一年份的笑話_52
Fun joke 解鬱集_一年份的笑話_52
psjlew
 
强人自助游
强人自助游强人自助游
强人自助游
psjlew
 
蔬菜界三大抗骨鬆高手
蔬菜界三大抗骨鬆高手蔬菜界三大抗骨鬆高手
蔬菜界三大抗骨鬆高手
psjlew
 
Presentation1
Presentation1Presentation1
Presentation1
Robby
 

Destaque (20)

How To Reverse Infertility & Have Your Bundles Of Joy
How To Reverse Infertility & Have Your Bundles Of JoyHow To Reverse Infertility & Have Your Bundles Of Joy
How To Reverse Infertility & Have Your Bundles Of Joy
 
群魔亂舞話七月_施永昌牧師0822
群魔亂舞話七月_施永昌牧師0822群魔亂舞話七月_施永昌牧師0822
群魔亂舞話七月_施永昌牧師0822
 
robinhoodworkpack1
 robinhoodworkpack1 robinhoodworkpack1
robinhoodworkpack1
 
L中国古代风俗百图
L中国古代风俗百图L中国古代风俗百图
L中国古代风俗百图
 
Les presentation 24 10 2013
Les presentation 24 10 2013Les presentation 24 10 2013
Les presentation 24 10 2013
 
Google Analytics IQ Lesson 3: Fundamentals
Google Analytics IQ Lesson 3: FundamentalsGoogle Analytics IQ Lesson 3: Fundamentals
Google Analytics IQ Lesson 3: Fundamentals
 
木製長橋
木製長橋木製長橋
木製長橋
 
世界四大博物館 法俄英美1308
世界四大博物館 法俄英美1308世界四大博物館 法俄英美1308
世界四大博物館 法俄英美1308
 
美聯社2011年度新聞圖片精選
美聯社2011年度新聞圖片精選美聯社2011年度新聞圖片精選
美聯社2011年度新聞圖片精選
 
Week 4: Algorithmic Game Theory Notes
Week 4: Algorithmic Game Theory NotesWeek 4: Algorithmic Game Theory Notes
Week 4: Algorithmic Game Theory Notes
 
外國人鏡頭下的八國聯軍
外國人鏡頭下的八國聯軍外國人鏡頭下的八國聯軍
外國人鏡頭下的八國聯軍
 
Semana da familia
Semana da familiaSemana da familia
Semana da familia
 
情繫俄羅斯(上):莫斯科 Moscow
情繫俄羅斯(上):莫斯科 Moscow情繫俄羅斯(上):莫斯科 Moscow
情繫俄羅斯(上):莫斯科 Moscow
 
最新科技成果
最新科技成果最新科技成果
最新科技成果
 
2014-1 computer algorithm w3 notes
2014-1 computer algorithm w3 notes2014-1 computer algorithm w3 notes
2014-1 computer algorithm w3 notes
 
Fun joke 解鬱集_一年份的笑話_52
Fun joke 解鬱集_一年份的笑話_52Fun joke 解鬱集_一年份的笑話_52
Fun joke 解鬱集_一年份的笑話_52
 
强人自助游
强人自助游强人自助游
强人自助游
 
Odessa
OdessaOdessa
Odessa
 
蔬菜界三大抗骨鬆高手
蔬菜界三大抗骨鬆高手蔬菜界三大抗骨鬆高手
蔬菜界三大抗骨鬆高手
 
Presentation1
Presentation1Presentation1
Presentation1
 

Semelhante a 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …

2005 Web Content Mining 4
2005 Web Content Mining   42005 Web Content Mining   4
2005 Web Content Mining 4
George Ang
 
First ML Experience
First ML ExperienceFirst ML Experience
First ML Experience
Amrith Kumar
 
At&t research at trecvid 2009
At&t research at trecvid 2009At&t research at trecvid 2009
At&t research at trecvid 2009
Kirill Lazarev
 
Proceedings Template - WORD
Proceedings Template - WORDProceedings Template - WORD
Proceedings Template - WORD
butest
 

Semelhante a 2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification … (20)

Recommendation Engine using Apache Mahout
Recommendation Engine using Apache MahoutRecommendation Engine using Apache Mahout
Recommendation Engine using Apache Mahout
 
RCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment ClassificationRCOMM 2011 - Sentiment Classification
RCOMM 2011 - Sentiment Classification
 
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMinerRCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification with RapidMiner
 
2005 Web Content Mining 4
2005 Web Content Mining   42005 Web Content Mining   4
2005 Web Content Mining 4
 
Video + Language 2019
Video + Language 2019Video + Language 2019
Video + Language 2019
 
Video + Language
Video + LanguageVideo + Language
Video + Language
 
Feature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsFeature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon Reviews
 
Ai use cases
Ai use casesAi use cases
Ai use cases
 
Unsupervised Computer Vision: The Current State of the Art
Unsupervised Computer Vision: The Current State of the ArtUnsupervised Computer Vision: The Current State of the Art
Unsupervised Computer Vision: The Current State of the Art
 
yelp data challenge
yelp data challengeyelp data challenge
yelp data challenge
 
Video+Language: From Classification to Description
Video+Language: From Classification to DescriptionVideo+Language: From Classification to Description
Video+Language: From Classification to Description
 
movie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD techmovie recommender system using vectorization and SVD tech
movie recommender system using vectorization and SVD tech
 
First ML Experience
First ML ExperienceFirst ML Experience
First ML Experience
 
Movie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial IntelligenceMovie recommendation Engine using Artificial Intelligence
Movie recommendation Engine using Artificial Intelligence
 
At&t research at trecvid 2009
At&t research at trecvid 2009At&t research at trecvid 2009
At&t research at trecvid 2009
 
REVIEW PPT.pptx
REVIEW PPT.pptxREVIEW PPT.pptx
REVIEW PPT.pptx
 
Proceedings Template - WORD
Proceedings Template - WORDProceedings Template - WORD
Proceedings Template - WORD
 
Learn you some Ansible for great good!
Learn you some Ansible for great good!Learn you some Ansible for great good!
Learn you some Ansible for great good!
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 

Mais de Dongseo University

Mais de Dongseo University (20)

Lecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdfLecture_NaturalPolicyGradientsTRPOPPO.pdf
Lecture_NaturalPolicyGradientsTRPOPPO.pdf
 
Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03Evolutionary Computation Lecture notes03
Evolutionary Computation Lecture notes03
 
Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02Evolutionary Computation Lecture notes02
Evolutionary Computation Lecture notes02
 
Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01Evolutionary Computation Lecture notes01
Evolutionary Computation Lecture notes01
 
Markov Chain Monte Carlo
Markov Chain Monte CarloMarkov Chain Monte Carlo
Markov Chain Monte Carlo
 
Simplex Lecture Notes
Simplex Lecture NotesSimplex Lecture Notes
Simplex Lecture Notes
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Median of Medians
Median of MediansMedian of Medians
Median of Medians
 
Average Linear Selection Algorithm
Average Linear Selection AlgorithmAverage Linear Selection Algorithm
Average Linear Selection Algorithm
 
Lower Bound of Comparison Sort
Lower Bound of Comparison SortLower Bound of Comparison Sort
Lower Bound of Comparison Sort
 
Running Time of Building Binary Heap using Array
Running Time of Building Binary Heap using ArrayRunning Time of Building Binary Heap using Array
Running Time of Building Binary Heap using Array
 
Running Time of MergeSort
Running Time of MergeSortRunning Time of MergeSort
Running Time of MergeSort
 
Binary Trees
Binary TreesBinary Trees
Binary Trees
 
Proof By Math Induction Example
Proof By Math Induction ExampleProof By Math Induction Example
Proof By Math Induction Example
 
TRPO and PPO notes
TRPO and PPO notesTRPO and PPO notes
TRPO and PPO notes
 
Estimating probability distributions
Estimating probability distributionsEstimating probability distributions
Estimating probability distributions
 
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)2018-2 Machine Learning (Wasserstein GAN and BEGAN)
2018-2 Machine Learning (Wasserstein GAN and BEGAN)
 
2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)2018-2 Machine Learning (Linear regression, Logistic regression)
2018-2 Machine Learning (Linear regression, Logistic regression)
 
2017-2 ML W11 GAN #1
2017-2 ML W11 GAN #12017-2 ML W11 GAN #1
2017-2 ML W11 GAN #1
 
2017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #52017-2 ML W9 Reinforcement Learning #5
2017-2 ML W9 Reinforcement Learning #5
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Último (20)

Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 

2013-1 Machine Learning Lecture 03 - Sergio Jimenez - Text Classification …