SlideShare a Scribd company logo
1 of 1
Download to read offline
Semi-supervised learning for sentiment analysis on organized crime
Introduction
Blogs have become an integral part of the Internet user experience; users will
normally visit and engage by expressing their opinion towards topics (blog posts). The
study of the subjectivity of these opinions is referred to as sentiment analysis and has
been applied to various domains (ex. Retail, politics, journalism). In this study we
focus on sentiment analysis on blogs relating to organized crime, in particular, we
study the scenario that Mexico is experiencing with the drug trafficking. We work with
data from blogdelnarco.com, a controversial blog which attracts as many as three
million hits per week. Individuals expressing their opinions on blog posts include:
Civilians, police and criminals. All these entities interact in the blog by providing very
strong opinions, yet there is also a high degree of noise with comments that fail to
express sentiment towards an entity (ex. Army, police, government agencies).
We used a semi-supervised approach by means of a Transductive Support Vector
Machine (TSVM) model to serve in sentiment classification. A TSVM can be thought
of as simply a Support Vector Machine (SVM) trained with some labeled data, then
unlabeled data is used to retrain the original model, and iterates this process until the
unlabeled data has had an influence on the model.
jaj
jaja
jajaa
jajaaa
jajaaj
jajaaja
jajaajaj
jajaajaja
jajaajajaj
jajaj
jajaja
jajajaa
jajajaaa
jajajaaj
jajajaaja
jajajaajaj
jajajaajajaj
jajajaajajaja
jajajaajja
jajajaj
jajajaja
jajajajaa
jajajajaaa
jajajajaaaa
jajajajaaj
jajajajaaja
jajajajaajaj
jajajajaajaja
jajajajaj
jajajajaja
jajajajajaa
jajajajajaaj
jajajajajaaja
jajajajajaj
jajajajajaja
jajajajajajaa
jajajajajajaaj
jajajajajajaaja
jajajajajajaj
jajajajajajaja
jajajajajajajaa
jajajajajajajaj
jajajajajajajaja
jajajajajajajajaj
jajajajajajajajaja
jajajajajajajajajaj
jajajajajajajajajaja
jajajajajajajajajajaj
jajajajajajajajajajaja
jajajajajajajajajajajaj
jajajajajajajajajajajaja
jajajajajajajajajajajajaj
jajajajajajajajajajajajaja
jajajajajajajajajajajajajaj
jajajajajajajajajajajajajaja
jajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajajajajajajajajajajajajajajajajaja
jajajajajajajajja
jajajajajajajajjaja
jajajajajajajja
jajajajajajja
jajajajajajjaja
jajajajajajjajaja
jajajajajja
jajajajajjaa
jajajajajjaja
jajajajajjajajaja
jajajajj
jajajajja
jajajajjaa
jajajajjaaj
jajajajjaj
jajajajjaja
jajajajjajaa
jajajajjajaj
jajajajjajaja
jajajajjajajaja
jajajajjajajajaja
jajajj
jajajja
jajajjaa
jajajjaaj
jajajjaaja
jajajjaj
jajajjaja
jajajjajaa
jajajjajaaj
jajajjajaj
jajajjajaja
jajajjajajaj
jajajjajajaja
jajajjajajajaja
jajajjja
jajja
jajjaa
jajjaaj
jajjaj
jajjaja
jajjajaa
jajjajaj
jajjajaja
jajjajajaj
jajjajajaja
jajjajajajaj
jajjajja
jajjja
jal
Methodology
1. Gathered data for eight months of activity using Blogger API and iMacro plug-in
for Firefox for text extraction.
2. Cleaned the text by removing diacritical marks and case folding tokens. Then
removed stop words (list of 249) and split comments that had paragraphs into
separate comments for cohesiveness.
3. Filtered out comments with entities not being tracked. Initial focus was on the
President of Mexico (Felipe Calderon) for which we used an ontology of aliases
that refer to him.
4. Labeled 420 instances with the help of humans as either positive or negative,
discarding any neutral. with help of humans.
5. Created and trained model using SVM Light Java package.
6. Classified unlabeled instances using the model created in previous step. We then
validate by using a 10-fold cross validation technique that compares accuracy of
an SVM and TSVM model.
Hypothesis: The use of a TSVM for sentiment classification will produce higher
accuracy when compared with a SVM.
Conclusion
We introduced the problem with blogs in the organized crime domain, we also
proposed that a TSVM would be the appropriate tool to do sentiment analysis on
comments given the particular settings of the corpus. The results in our work clearly
show the advantage of following a semi-supervised learning approach by means of a
TSVM. Yet, one may argue that training TSVMs is time consuming, for instance, in
order to classify the 5000+ instances a dual core machine took approximately three
hours whereas a SVM took less than one minute to classify the same amount of
instances.
Despite the time tradeoff, accuracy seems promising, especially after identifying
sources of improvement. We have made the dataset we
studied publicly available and invite you to play with it:
http://cs.utexas.edu/~gcabrera/data.zip
Guillermo Cabrera 1 and Jason Baldridge 2
Problem
 How does one define sentiment analysis in the context of organized crime?
 How to deal with complex use of language: Profanity, colloquial, new vocabulary,
bad use of grammar and capitalization. Also need to take into account multiple
aliases that refer to same entity.
 Why would this be helpful to government institutions?
Sample comments showing a few of the attributes mentioned above. An individual
with knowledge of Spanish language and culture in Mexico might find it trivial to
assign a positive label to (1) and a negative to (2).
State of the Art
Efforts in sentiment analysis takes mainly one of two forms, the first one is a
knowledge based approach, where dictionary determines the polarity for words and is
then matched to the dataset. The second makes use of machine learning techniques
where a classifier is used and is fed labeled instances as training data.
 Melville et al. (2009) takes a supervised learning approach and constructs a
generative model from a polarity annotated lexicon and then builds a model trained
on labeled documents.
 Chen and Lin (2010) argue the importance of the class imbalance problem by
stating that in blogs there will be far fewer instances that will be negative if there
are a greater number of positive instances.
 There has been work such as in Bautin et al. (2008) that makes use of machine
translation to convert text into English and then do sentiment analysis in this
language. They mention that although the translation process has some negative
effect, this was not a significant issue in their experiments.
Results
For the case of an SVM, we can see how accuracy gets better as we include more
labeled instances to train our model. On the other hand, our TSVM has a significant
improvement in accuracy even though we never used 100% of all training instances
to train the initial TSVM. Also, we can use a far less number of labeled instances
(compare the 73.3% at 60% for TSVM to the 67.7% at 100% for SVM) and still get
better results using a TSVM.
Future Work
Identified errors in our approach such as: including derogative aliases in our ontology
for the President, taking a coarse granularity when breaking on paragraphs instead of
sentences.
 Subjectivity classification before polarity classification.
 Plan to use openNLP Sentence Detector.
 Stemming even when bad use of orthography.
 Dimensionality reduction by means of minimal shadows (Pandey and Iyer (2009).
(1) A la bio a la bao a la bim bom ba, los federales, ra ra ra / hooray for
the federales.
(2) La ****** pfp solo viene a robar, matar y extorsionar, fuera / the
******* pfp only comes to steal, kill and extort, out.
Table 1: Accuracy comparison of TSVM (Highlighted values plotted in Figure 2).
Figure 1: Experiments to test accuracy for SVMs (left) and TSVMs (right).
Figure 2: Accuracy comparison: SVM and TSVM Figure 3: Sentiment over time for President
agarrar
(grab/catch)
agarado
agaran
agarando
agarar
agararon
agaren
garrence
agarra
agarraban
agarraditos
agarrado
agarrados
agarralos
agarrame
agarramos
agarran
agarrando
agarrandose
agarrara
agarraran
agarrarlo
agarrarlos
agarrarnos
agarraron
agarrarse
agarrarte
agarras
agarraste
agarrate
agarre
agarremos
agarren
agarrenlos
agarrense
agarres
agarro
agarron
Figure 4: Use of stemming to reduce dimensions
Shuffled Shuffled
420 420 420 420
105 210 315 420
k-1
1
105 210 315 420
k-1
1
100
%
20% 40% 60% 80%
Percentage Included
Labeled Instance 20% 40% 60% 80%
105 70.7 70.7 73.3 72.0
210 69.5 74.3 75.5 76.3
315 67.5 71.3 73.9 77.2
420 73.4 71.3 76.6 78.8
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
105 210 315 420
Accuracy(%)
# Labeled Instances
SVM TSVM
0
500
1000
1500
2000
2500
May June July August September October
#ofMessages
Positive Negative
1 Department of Computer Science, 1 University Station C0500; 2 Department of Linguistics , 1 University Station B5100. Austin, TX 78712

More Related Content

Viewers also liked

Recently played
Recently playedRecently played
Recently playedtommyglass
 
Rumbold en Social Dialog 'improved customer engagement'
Rumbold en Social Dialog 'improved customer engagement'Rumbold en Social Dialog 'improved customer engagement'
Rumbold en Social Dialog 'improved customer engagement'Erik Janssen
 
Definicion de alternancia
Definicion de alternanciaDefinicion de alternancia
Definicion de alternanciaT12Y
 
Vetrny Mlyn Zdravi, Grapefruit Dieta, 90 Pocitat
Vetrny Mlyn Zdravi, Grapefruit Dieta, 90 Pocitat Vetrny Mlyn Zdravi, Grapefruit Dieta, 90 Pocitat
Vetrny Mlyn Zdravi, Grapefruit Dieta, 90 Pocitat vulgargas8778
 
Motor de circuito
Motor de circuitoMotor de circuito
Motor de circuitoNaza Solo
 
PaysanBreton27juin3 juillet
PaysanBreton27juin3 juilletPaysanBreton27juin3 juillet
PaysanBreton27juin3 juilletgenesdiffusion
 
Seminario tarea 2
Seminario tarea 2Seminario tarea 2
Seminario tarea 2anadiazz
 
Aspi̇ri̇n ASIRLIK MARKA
Aspi̇ri̇n ASIRLIK MARKAAspi̇ri̇n ASIRLIK MARKA
Aspi̇ri̇n ASIRLIK MARKARaci Göktaş
 
Hayatın ortağı yapalım
Hayatın ortağı yapalımHayatın ortağı yapalım
Hayatın ortağı yapalımRaci Göktaş
 
管理學971商訪B班第2組
管理學971商訪B班第2組管理學971商訪B班第2組
管理學971商訪B班第2組祐承 鄭
 

Viewers also liked (20)

Key
KeyKey
Key
 
Recently played
Recently playedRecently played
Recently played
 
Mohamed+Khaled+Hosni
Mohamed+Khaled+HosniMohamed+Khaled+Hosni
Mohamed+Khaled+Hosni
 
Mueck sculptures
Mueck sculpturesMueck sculptures
Mueck sculptures
 
Rumbold en Social Dialog 'improved customer engagement'
Rumbold en Social Dialog 'improved customer engagement'Rumbold en Social Dialog 'improved customer engagement'
Rumbold en Social Dialog 'improved customer engagement'
 
EDUCACIÓN A DISTANCIA
EDUCACIÓN A DISTANCIAEDUCACIÓN A DISTANCIA
EDUCACIÓN A DISTANCIA
 
commendation letter
commendation lettercommendation letter
commendation letter
 
Definicion de alternancia
Definicion de alternanciaDefinicion de alternancia
Definicion de alternancia
 
Exp Cert
Exp CertExp Cert
Exp Cert
 
Doc7
Doc7Doc7
Doc7
 
Prevision mercredi 02 juillet 2014
Prevision mercredi 02 juillet 2014Prevision mercredi 02 juillet 2014
Prevision mercredi 02 juillet 2014
 
Vetrny Mlyn Zdravi, Grapefruit Dieta, 90 Pocitat
Vetrny Mlyn Zdravi, Grapefruit Dieta, 90 Pocitat Vetrny Mlyn Zdravi, Grapefruit Dieta, 90 Pocitat
Vetrny Mlyn Zdravi, Grapefruit Dieta, 90 Pocitat
 
Motor de circuito
Motor de circuitoMotor de circuito
Motor de circuito
 
PaysanBreton27juin3 juillet
PaysanBreton27juin3 juilletPaysanBreton27juin3 juillet
PaysanBreton27juin3 juillet
 
Seminario tarea 2
Seminario tarea 2Seminario tarea 2
Seminario tarea 2
 
Aspi̇ri̇n ASIRLIK MARKA
Aspi̇ri̇n ASIRLIK MARKAAspi̇ri̇n ASIRLIK MARKA
Aspi̇ri̇n ASIRLIK MARKA
 
Mapa Ablaneda
Mapa AblanedaMapa Ablaneda
Mapa Ablaneda
 
Hayatın ortağı yapalım
Hayatın ortağı yapalımHayatın ortağı yapalım
Hayatın ortağı yapalım
 
管理學971商訪B班第2組
管理學971商訪B班第2組管理學971商訪B班第2組
管理學971商訪B班第2組
 
Dr Phil test
Dr Phil testDr Phil test
Dr Phil test
 

Similar to cabreraPosterSentimentAnalysis

A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
 
Text Classification.pptx
Text Classification.pptxText Classification.pptx
Text Classification.pptxhezamgawbah
 
Practical Knowledge Representation
Practical Knowledge RepresentationPractical Knowledge Representation
Practical Knowledge Representationbutest
 
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...AhmedAdilNafea
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGkevig
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
 
A Survey on Sentiment Mining Techniques
A Survey on Sentiment Mining TechniquesA Survey on Sentiment Mining Techniques
A Survey on Sentiment Mining TechniquesKhan Mostafa
 
Unit 1 Introduction to Artificial Intelligence.pptx
Unit 1 Introduction to Artificial Intelligence.pptxUnit 1 Introduction to Artificial Intelligence.pptx
Unit 1 Introduction to Artificial Intelligence.pptxDr.M.Karthika parthasarathy
 
Annotating Hate Speech Three Schemes At Comparison
Annotating Hate Speech  Three Schemes At ComparisonAnnotating Hate Speech  Three Schemes At Comparison
Annotating Hate Speech Three Schemes At ComparisonRenee Lewis
 
A data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networksA data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networksYassine Bensaoucha
 
Lect 7 intro to M.L..pdf
Lect 7 intro to M.L..pdfLect 7 intro to M.L..pdf
Lect 7 intro to M.L..pdfHassanElalfy4
 
Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis csandit
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-ServiceMarius Corici
 
On Semantics and Deep Learning for Event Detection in Crisis Situations
On Semantics and Deep Learning for Event Detection in Crisis SituationsOn Semantics and Deep Learning for Event Detection in Crisis Situations
On Semantics and Deep Learning for Event Detection in Crisis SituationsCOMRADES project
 
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...mlaij
 

Similar to cabreraPosterSentimentAnalysis (20)

A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEA FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
 
Text Classification.pptx
Text Classification.pptxText Classification.pptx
Text Classification.pptx
 
A survey on approaches for performing sentiment analysis ijrset october15
A survey on approaches for performing sentiment analysis ijrset october15A survey on approaches for performing sentiment analysis ijrset october15
A survey on approaches for performing sentiment analysis ijrset october15
 
Practical Knowledge Representation
Practical Knowledge RepresentationPractical Knowledge Representation
Practical Knowledge Representation
 
1808.10245v1 (1).pdf
1808.10245v1 (1).pdf1808.10245v1 (1).pdf
1808.10245v1 (1).pdf
 
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
A Hybrid Method of Long Short-Term Memory and AutoEncoder Architectures for S...
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGFAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MINING
 
A Survey on Sentiment Mining Techniques
A Survey on Sentiment Mining TechniquesA Survey on Sentiment Mining Techniques
A Survey on Sentiment Mining Techniques
 
Supervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured TextSupervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured Text
 
Unit 1 Introduction to Artificial Intelligence.pptx
Unit 1 Introduction to Artificial Intelligence.pptxUnit 1 Introduction to Artificial Intelligence.pptx
Unit 1 Introduction to Artificial Intelligence.pptx
 
Annotating Hate Speech Three Schemes At Comparison
Annotating Hate Speech  Three Schemes At ComparisonAnnotating Hate Speech  Three Schemes At Comparison
Annotating Hate Speech Three Schemes At Comparison
 
A data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networksA data mining tool for the detection of suicide in social networks
A data mining tool for the detection of suicide in social networks
 
Lect 7 intro to M.L..pdf
Lect 7 intro to M.L..pdfLect 7 intro to M.L..pdf
Lect 7 intro to M.L..pdf
 
Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis Explore the Effects of Emoticons on Twitter Sentiment Analysis
Explore the Effects of Emoticons on Twitter Sentiment Analysis
 
unit-5.pdf
unit-5.pdfunit-5.pdf
unit-5.pdf
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
 
On Semantics and Deep Learning for Event Detection in Crisis Situations
On Semantics and Deep Learning for Event Detection in Crisis SituationsOn Semantics and Deep Learning for Event Detection in Crisis Situations
On Semantics and Deep Learning for Event Detection in Crisis Situations
 
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...Classification of Machine Translation Outputs Using NB Classifier and SVM for...
Classification of Machine Translation Outputs Using NB Classifier and SVM for...
 

cabreraPosterSentimentAnalysis

  • 1. Semi-supervised learning for sentiment analysis on organized crime Introduction Blogs have become an integral part of the Internet user experience; users will normally visit and engage by expressing their opinion towards topics (blog posts). The study of the subjectivity of these opinions is referred to as sentiment analysis and has been applied to various domains (ex. Retail, politics, journalism). In this study we focus on sentiment analysis on blogs relating to organized crime, in particular, we study the scenario that Mexico is experiencing with the drug trafficking. We work with data from blogdelnarco.com, a controversial blog which attracts as many as three million hits per week. Individuals expressing their opinions on blog posts include: Civilians, police and criminals. All these entities interact in the blog by providing very strong opinions, yet there is also a high degree of noise with comments that fail to express sentiment towards an entity (ex. Army, police, government agencies). We used a semi-supervised approach by means of a Transductive Support Vector Machine (TSVM) model to serve in sentiment classification. A TSVM can be thought of as simply a Support Vector Machine (SVM) trained with some labeled data, then unlabeled data is used to retrain the original model, and iterates this process until the unlabeled data has had an influence on the model. jaj jaja jajaa jajaaa jajaaj jajaaja jajaajaj jajaajaja jajaajajaj jajaj jajaja jajajaa jajajaaa jajajaaj jajajaaja jajajaajaj jajajaajajaj jajajaajajaja jajajaajja jajajaj jajajaja jajajajaa jajajajaaa jajajajaaaa jajajajaaj jajajajaaja jajajajaajaj jajajajaajaja jajajajaj jajajajaja jajajajajaa jajajajajaaj jajajajajaaja jajajajajaj jajajajajaja jajajajajajaa jajajajajajaaj jajajajajajaaja jajajajajajaj jajajajajajaja jajajajajajajaa jajajajajajajaj jajajajajajajaja jajajajajajajajaj jajajajajajajajaja jajajajajajajajajaj jajajajajajajajajaja jajajajajajajajajajaj jajajajajajajajajajaja jajajajajajajajajajajaj jajajajajajajajajajajaja jajajajajajajajajajajajaj jajajajajajajajajajajajaja jajajajajajajajajajajajajaj jajajajajajajajajajajajajaja jajajajajajajajajajajajajajaja jajajajajajajajajajajajajajajaja jajajajajajajajajajajajajajajajaja jajajajajajajajajajajajajajajajajaja jajajajajajajajajajajajajajajajajajaja jajajajajajajajajajajajajajajajajajajaja jajajajajajajajajajajajajajajajajajajajaja jajajajajajajajajajajajajajajajajajajajajaja jajajajajajajajajajajajajajajajajajajajajajajaja jajajajajajajajja jajajajajajajajjaja jajajajajajajja jajajajajajja jajajajajajjaja jajajajajajjajaja jajajajajja jajajajajjaa jajajajajjaja jajajajajjajajaja jajajajj jajajajja jajajajjaa jajajajjaaj jajajajjaj jajajajjaja jajajajjajaa jajajajjajaj jajajajjajaja jajajajjajajaja jajajajjajajajaja jajajj jajajja jajajjaa jajajjaaj jajajjaaja jajajjaj jajajjaja jajajjajaa jajajjajaaj jajajjajaj jajajjajaja jajajjajajaj jajajjajajaja jajajjajajajaja jajajjja jajja jajjaa jajjaaj jajjaj jajjaja jajjajaa jajjajaj jajjajaja jajjajajaj jajjajajaja jajjajajajaj jajjajja jajjja jal Methodology 1. Gathered data for eight months of activity using Blogger API and iMacro plug-in for Firefox for text extraction. 2. Cleaned the text by removing diacritical marks and case folding tokens. Then removed stop words (list of 249) and split comments that had paragraphs into separate comments for cohesiveness. 3. Filtered out comments with entities not being tracked. Initial focus was on the President of Mexico (Felipe Calderon) for which we used an ontology of aliases that refer to him. 4. Labeled 420 instances with the help of humans as either positive or negative, discarding any neutral. with help of humans. 5. Created and trained model using SVM Light Java package. 6. Classified unlabeled instances using the model created in previous step. We then validate by using a 10-fold cross validation technique that compares accuracy of an SVM and TSVM model. Hypothesis: The use of a TSVM for sentiment classification will produce higher accuracy when compared with a SVM. Conclusion We introduced the problem with blogs in the organized crime domain, we also proposed that a TSVM would be the appropriate tool to do sentiment analysis on comments given the particular settings of the corpus. The results in our work clearly show the advantage of following a semi-supervised learning approach by means of a TSVM. Yet, one may argue that training TSVMs is time consuming, for instance, in order to classify the 5000+ instances a dual core machine took approximately three hours whereas a SVM took less than one minute to classify the same amount of instances. Despite the time tradeoff, accuracy seems promising, especially after identifying sources of improvement. We have made the dataset we studied publicly available and invite you to play with it: http://cs.utexas.edu/~gcabrera/data.zip Guillermo Cabrera 1 and Jason Baldridge 2 Problem  How does one define sentiment analysis in the context of organized crime?  How to deal with complex use of language: Profanity, colloquial, new vocabulary, bad use of grammar and capitalization. Also need to take into account multiple aliases that refer to same entity.  Why would this be helpful to government institutions? Sample comments showing a few of the attributes mentioned above. An individual with knowledge of Spanish language and culture in Mexico might find it trivial to assign a positive label to (1) and a negative to (2). State of the Art Efforts in sentiment analysis takes mainly one of two forms, the first one is a knowledge based approach, where dictionary determines the polarity for words and is then matched to the dataset. The second makes use of machine learning techniques where a classifier is used and is fed labeled instances as training data.  Melville et al. (2009) takes a supervised learning approach and constructs a generative model from a polarity annotated lexicon and then builds a model trained on labeled documents.  Chen and Lin (2010) argue the importance of the class imbalance problem by stating that in blogs there will be far fewer instances that will be negative if there are a greater number of positive instances.  There has been work such as in Bautin et al. (2008) that makes use of machine translation to convert text into English and then do sentiment analysis in this language. They mention that although the translation process has some negative effect, this was not a significant issue in their experiments. Results For the case of an SVM, we can see how accuracy gets better as we include more labeled instances to train our model. On the other hand, our TSVM has a significant improvement in accuracy even though we never used 100% of all training instances to train the initial TSVM. Also, we can use a far less number of labeled instances (compare the 73.3% at 60% for TSVM to the 67.7% at 100% for SVM) and still get better results using a TSVM. Future Work Identified errors in our approach such as: including derogative aliases in our ontology for the President, taking a coarse granularity when breaking on paragraphs instead of sentences.  Subjectivity classification before polarity classification.  Plan to use openNLP Sentence Detector.  Stemming even when bad use of orthography.  Dimensionality reduction by means of minimal shadows (Pandey and Iyer (2009). (1) A la bio a la bao a la bim bom ba, los federales, ra ra ra / hooray for the federales. (2) La ****** pfp solo viene a robar, matar y extorsionar, fuera / the ******* pfp only comes to steal, kill and extort, out. Table 1: Accuracy comparison of TSVM (Highlighted values plotted in Figure 2). Figure 1: Experiments to test accuracy for SVMs (left) and TSVMs (right). Figure 2: Accuracy comparison: SVM and TSVM Figure 3: Sentiment over time for President agarrar (grab/catch) agarado agaran agarando agarar agararon agaren garrence agarra agarraban agarraditos agarrado agarrados agarralos agarrame agarramos agarran agarrando agarrandose agarrara agarraran agarrarlo agarrarlos agarrarnos agarraron agarrarse agarrarte agarras agarraste agarrate agarre agarremos agarren agarrenlos agarrense agarres agarro agarron Figure 4: Use of stemming to reduce dimensions Shuffled Shuffled 420 420 420 420 105 210 315 420 k-1 1 105 210 315 420 k-1 1 100 % 20% 40% 60% 80% Percentage Included Labeled Instance 20% 40% 60% 80% 105 70.7 70.7 73.3 72.0 210 69.5 74.3 75.5 76.3 315 67.5 71.3 73.9 77.2 420 73.4 71.3 76.6 78.8 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 105 210 315 420 Accuracy(%) # Labeled Instances SVM TSVM 0 500 1000 1500 2000 2500 May June July August September October #ofMessages Positive Negative 1 Department of Computer Science, 1 University Station C0500; 2 Department of Linguistics , 1 University Station B5100. Austin, TX 78712