1. + Jan Žižka
František Dařena
Department
of
Faculty of
Business
Informatics and
Economics
Mendel Czech
University Republic
in Brno
MINING SIGNIFICANT WORDS FROM
CUSTOMER OPINIONS WRITTEN IN
DIFFERENT NATURAL LANGUAGES
2. +
Introduction
Many companies collect opinions expressed
by their customers.
These opinions can hide valuable knowledge.
Discovering the knowledge by people can be
sometimes a very demanding task because
the opinion database can be very large,
the customers can use different languages,
the people can handle the opinions subjectively,
sometimes additional resources (like lists of positive
and negative words) might be needed.
3. +
Objective
For answering the question “What is
significant for including a certain
opinion into one of categories like
satisfied or dissatisfied customers?”
automatically extract words significant
for positive and negative customers'
opinions and to form not too large
dictionaries of these words.
4. +
Data description
Processed data included reviews of hotel clients
collected from publicly available sources.
The reviews were labeled as positive and
negative.
Reviews characteristics:
more than 5,000,000 reviews,
written in more than 25 natural languages,
written only by real customers, based on a real
experience,
written relatively carefully but still containing errors that
are typical for natural languages.
5. +
Review examples
Positive
The breakfast and the very clean rooms stood out as the best
features of this hotel.
Clean and moden, the great loation near station. Friendly
reception!
The rooms are new. The breakfast is also great. We had a really
nice stay.
Good location - very quiet and good breakfast.
Negative
High price charged for internet access which actual cost now
is extreamly low.
water in the shower did not flow away
The room was noisy and the room temperature was higher
than normal.
The air conditioning wasn't working
6. +
Data preparation
Data collection, cleaning (removing tags, non-
letter characters), converting to upper-case.
Transforming into the bag-of-words
representation, term frequencies (TF) used as
attribute values.
Removing the words with global frequency < 2.
Stemming, stopwords removing, spell
checking, diacritics removal etc. were not
carried out.
7. +
Data characteristics
1200000
1000000
800000
number of reviews
positive
600000
negative
400000
200000
0
English French Spanish German Italian Russian Japan Czech
8. +
Data characteristics
250000
200000
number of unique words
150000
MinTF=1
MinTF=2
100000
50000
0
English German Japan French Spanish Italian Russian Czech
9. +
Finding the significant words
Thanksto having a large collection of labeled
examples a classifier that separates positive and
negative reviews could be created.
To reveal significant attributes (words) a decision
tree was built using the tree-generating algorithm
c5 (by R. Quinlan) based on entropy minimization.
The goal was not to achieve the best classification
accuracy but to find relevant attributes that
contribute to assigning a text to a given class.
The significant words appeared in the nodes of the
decision tree.
11. +
Finding the significant words
The classification accuracy which is proportional to
the relevancy of words was between 83 – 93%.
Thedecision tree mostly asked if the frequency
was > 0 or = 0 (binary representation).
Thedecision tree provides a list of about 200-300
words significant for classification from the
sentiment perspective together with the
significance (i.e. the frequency of using the words
during classification) of the words.
Only15 words for each language is presented
together with their significance (column %).
12. +
Handling large collections
For
languages with large amount of reviews the
datasets were randomly split into subsets
consisting of 50,000 reviews because of memory
requirements and a decision tree was created for
each such subset.
Each
of the 50,000-sample subsets gave almost the
same list of words.
The relevancies of extracted words were averaged.
17. +
Conclusions
A procedure how to apply computers, machine
learning, and natural language processing areas to
automatically find significant words was presented.
From the total number of words (80,000–200,000) only
about 200–300 were identified as significant.
The simple, unified procedure worked well for many
languages.
Following research focuses on determining the
strength of sentiment and on generating typical short
phrases instead of only creating individual words.
The procedure might be used during the marketing
research or marketing intelligence, for filtering
reviews, generating lists of key-words etc.
18. Thank you for your attention
Vielen Dank für Ihre Aufmerksamkeit
Gracias por vuestra atención
Merci de votre attention
Grazie per la vostra attenzione
Спасибо за ваше внимание
ご静聴ありがとうございました
Děkuji za vaši pozornost