2. Content
1. Background
2. Formulating the problem
3. Data Mining Process
4. Techniques
5. Analysis
01
3. What is Data
Mining?
• Extraction of meaningful / useful / Interesting
patterns from a large volume of data sources
• In this project, the source will be large
volume of WEB HOTEL REVIEWS data
• Data mining is one of the top ten emerging
technology
MIT’s TECHNOLOGY REVIEW 2004
4. What is Data
•
Mining?
Process of exploration and analysis
• By automatic / semi automatic means
• With little or no human interactions
• To discover meaningful patterns and rules
MASTERING DATA MINING BY BERRY AND LINOFF, 2000
5. User’s Opinions in Hotel
• Increase in social media and web
user
• Increase in valuable opinion
oriented data in Hotel due to web
expansion
• Identify potential hotel to stay by
looking at the aspects
• Overall Sentiments on hotel are
greatly sought on the web for
Sentiment Analysis
6. What can Data Mining do?
• Identify best prospects
(ASPECTS), and retain customers
• Predict what ASPECTS
customers like and promote
accordingly
• Learn parameters influencing
trends in sales and margins
• Identification of opinions for
customers
Sentiment Analysis !!!
7. What are the problems?
• Exponential growth of user’s
opinions
• Limitations of human analysis
• Accuracy of human analysis
Machines can be trained to take
over human analysis with advanced
computer technology and it is done
with LOW COST
8. Some Limitations of machines
• Unable to read like a human
• No emotions
• Cannot detect sarcasm
• Expression of sentiments in
different topic and domain
• Polarity analysis
• Facts Vs Opinion
9. Some machine limitation
• “The service is as good as none”.
examples
Negation not obvious to machine
• “Swimming pool is big enough to
swim with comfort” , “There is a
big crowd at the counter
complaining”. Polarity might
change with context.
• “The room is warmer than the
lobby”. Comparisons are hard to
classify
11. Machine Learning
• A tool for data mining and intelligent decision
support
• Application of computer algorithms that
improve automatically through experience
MASTERING DATA MINING BY BERRY AND LINOFF, 2000
12. Types of Machine learning
• Supervised Learning
• A training set is provided (data
with correct answers) which is
used to mine for known pattern
• Unsupervised Learning
• Data are provided with no prior
knowledge of the hidden
patterns that they contain.
• Semi Supervised Learning
14. Project Objective
• Prediction of sentence polarity
• Classification of polarity for sentiment
lexicon
• Detection of relations
15. Pre-requisite
• Large data set
• Relevant Prior Knowledge to
domain, in our case the hotel
domain
• Eg. Rating
• Sentiment lexicon for sentiment
analysis
• Data selection for reliability and
standards
17. Cleaning the “Dirty” Data (60% of
• effort)
Frequent problem : Data inconsistencies
• Duplicate data
• Spelling Errors != Trim from data
• Foreign accent and characters
• Singular / Plural conversion
• Punctuations removal / replacement
• Noise and incomplete data
• Naming convention misused, same name but
different meaning
19. Findings
• Part of Speech Tagging (POS) using Brill
Tagger - NO PROBLEM
-95% accuracy POS tagging words after data
cleaning
20. Findings
• Polarity tagging using sentiment lexicon –
BIG PROBLEM
-40% sentiment words not found in sentiment
lexicon
-10% sentiment words with a positive or
negative polarity found are in the neutral section
of sentiment lexicon
21. Problems
• Sentiment lexicon not comprehensive to fulfill
machine learning technique adopted
• Polarity of sentiment words who are domain
dependent are founded in neutral section of
sentiment lexicon
• Polarity of sentiment words can also change
within the domain even though they are
domain dependent
EXPANSION OF LEXICON !!!
22. Solution
• Classify the polarity of unlabeled sentiment
word using rule based mining
• Classify domain dependent sentiment words
• Establish word relations between labeled and
unlabeled sentiment words
23. Data Processing
• Rule based mining using conjunction and
punctuation
Polarity Assignment Rules
Same Adj – AND/OR - Adj
Opposite Neg - Adj – AND/OR - Adj /
Adj – AND/OR - Neg- Adj
Same Neg - Adj – AND/OR - Neg- Adj
Opposite Adj – BUT/NOR – Adj
Same Neg - Adj – BUT/NOR - Adj /
Adj – BUT/NOR - Neg- Adj
Opposite Neg - Adj – BUT/NOR - Neg- Adj
Same Adj , Adj
26. Analysis
• Using the expanded sentiment lexicon, we
analyze the polarity sentiment by doing a
sentiment lookup using Bayesian Network
27. Bayesian
• To determine polarity of sentiments
P(X | Y) = P(X) P(Y | X) / P(Y)
• Probability that a sentiments is positive or
negative, given it's contents
• Assumptions: There is no link between words
• P(sentiment | sentence) =
P(sentiment)P(sentence | sentiment) /
P(sentence)
28. Validation
• Precision = N (agree & found) / N (found)
• High precision means most of the correct
sentiment words are found by the system
• Recall = N (agree & found) / N (agree)
• High recall means most of found sentiment
words are correctly labeled by the system
29. Validation Results
• It is found that out of the 350 aspect-
unlabelled sentiment word pairs,
• Only 194 are founded by the methods.
Thus, the precision is about 57%.
• The recall is also not very high; only 126
words are corrected labelled by the
system, which is about 63%.
30. Discussion
• The results will improve if more rules are
applied such the inclusion of more adverbs
such as “excessively” as negation words.
• There might not be enough dataset for the
system to work on. There are only 350 aspect-
unlabelled sentiment word pairs for the
application to work with.
• This, however requires more human judges to
validate the data
31. Conclusion
• Comprehensive Sentiment Lexicon is a
simple yet effective solution to sentiment
analysis as it does not requires prior training
• Current sentiment lexicon does not capture
such domain and context sensitivities of
sentiment expressions
32. Conclusion
• This leads to poor coverage
• Thus, expanding general sentiment lexicon to
capture domain and context sensitivities of
sentiment expressions are advocated
What can data mining do in a hotel domain, in other words, learn the market
Impossible for humans to read every single opinionsBiased of humans to read certain opinionsMachinesAllow fast access to vast amount of dataAllow computational intensive algorithm and statistical methods
Impossible for humans to read every single opinionsBiased of humans to read certain opinionsMachinesAllow fast access to vast amount of dataAllow computational intensive algorithm and statistical methods
Many fields of data mining and in this project we will focus on these 4
Growing data volume , limitation of humans and low cost to human
The goal for unsupervised learning is to discover these patternsSemi – Knowledge is known and applied from one data collection in order to mine, classify, analyze, interpret a related data collection
Some of the problems to be solved by data miningPrediction of sentence polarityClassification of polarity for sentiment lexiconDetection of relations
Data inconsistencies: Say good in the title but in the review say bad
Assigning a label to every word in the text to allow machine to do something with it
Pos tagging wrong due to some word like heart having double tagging
For example, in the domain of handheld devices, the word “large” can express positivity for screen size but negativity in the phone size.
Assigning a label to every word in the text to allow machine to do something with it
After establishing relations, we have a graph of nodes (Sentiments / Aspects)Determine the probability that the node is positive or negative given its surrounding nodesStart with a high frequency unlabelled sentiment word-aspect pair then based on the aspect and its label semtiment pair, determine the polarity for the unlabelThis process iterate till all unlabe found their polarity
After establishing relations, we have a graph of nodes (Sentiments / Aspects)Determine the probability that the node is positive or negative given its surrounding nodesStart with a high frequency unlabelled sentiment word-aspect pair then based on the aspect and its label semtiment pair, determine the polarity for the unlabelThis process iterate till all unlabe found their polarity
Assigning a label to every word in the text to allow machine to do something with it
A comprehensive sentiment lexicon can provide a simple yet effective solution to sentiment analysis, because it is general and does not require prior training. Therefore, attention and effort have been paid to the construction of such lexicons. However, a significant challenge to this approach is that the polarity of many words is domain and context dependent. For example, ‘long’ is positive in ‘long battery life’ and negative in ‘long shutter lag.’ Current sentiment lexicons do not capture such domain and context sensitivities of sentiment expressions. They either exclude such domain and context dependent sentiment expressions or tag them with an overall polarity tendency based on statistics gathered from certain corpus such as the world wide web accessed via the internet. While excluding such expressions leads to poor coverage, simply tagging them with a polarity tendency leads to poor precision.
ATheyeither exclude such domain and context dependent sentiment expressions or tag them with an overall polarity tendency based on statistics gathered from certain corpus such as the world wide web accessed via the internet. While excluding such expressions leads to poor coverage, simply tagging them with a polarity tendency leads to poor precision.