SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
+              Jan Žižka
       František Dařena
                           Department
                           of
                                         Faculty of
                                         Business
                           Informatics   and
                                         Economics




                           Mendel        Czech
                           University    Republic
                           in Brno




    MINING SIGNIFICANT WORDS FROM
    CUSTOMER OPINIONS WRITTEN IN
    DIFFERENT NATURAL LANGUAGES
+
    Introduction


     Many  companies collect opinions expressed
      by their customers.
     These opinions can hide valuable knowledge.
     Discovering the knowledge by people can be
      sometimes a very demanding task because
      the opinion database can be very large,
      the customers can use different languages,
      the people can handle the opinions subjectively,
      sometimes additional resources (like lists of positive
       and negative words) might be needed.
+
    Objective


    For answering the question “What is
     significant for including a certain
     opinion into one of categories like
     satisfied or dissatisfied customers?”
     automatically extract words significant
     for positive and negative customers'
     opinions and to form not too large
     dictionaries of these words.
+
    Data description

     Processed  data included reviews of hotel clients
      collected from publicly available sources.
     The reviews were labeled as positive and
      negative.
     Reviews characteristics:
      more than 5,000,000 reviews,
      written in more than 25 natural languages,
      written only by real customers, based on a real
       experience,
      written relatively carefully but still containing errors that
       are typical for natural languages.
+
    Review examples

       Positive
           The breakfast and the very clean rooms stood out as the best
            features of this hotel.
           Clean and moden, the great loation near station. Friendly
            reception!
           The rooms are new. The breakfast is also great. We had a really
            nice stay.
           Good location - very quiet and good breakfast.

       Negative
           High price charged for internet access which actual cost now
            is extreamly low.
           water in the shower did not flow away
           The room was noisy and the room temperature was higher
            than normal.
           The air conditioning wasn't working
+
    Data preparation


     Data  collection, cleaning (removing tags, non-
      letter characters), converting to upper-case.
     Transforming into the bag-of-words
      representation, term frequencies (TF) used as
      attribute values.
     Removing the words with global frequency < 2.
     Stemming, stopwords removing, spell
      checking, diacritics removal etc. were not
      carried out.
+
                     Data characteristics

                    1200000



                    1000000



                     800000
number of reviews




                                                                                                        positive
                     600000
                                                                                                        negative

                     400000



                     200000



                          0
                              English   French   Spanish   German   Italian   Russian   Japan   Czech
+
                          Data characteristics

                         250000




                         200000
number of unique words




                         150000

                                                                                                            MinTF=1
                                                                                                            MinTF=2
                         100000




                          50000




                              0
                                  English   German   Japan   French   Spanish   Italian   Russian   Czech
+
    Finding the significant words

     Thanksto having a large collection of labeled
     examples a classifier that separates positive and
     negative reviews could be created.
     To reveal significant attributes (words) a decision
      tree was built using the tree-generating algorithm
      c5 (by R. Quinlan) based on entropy minimization.
     The goal was not to achieve the best classification
      accuracy but to find relevant attributes that
      contribute to assigning a text to a given class.
     The significant words appeared in the nodes of the
      decision tree.
+
    An example of a decision tree

    LOCATION > 0:
    :...POOR > 0:
    :   :...GOOD > 0: _P (13)
    :   :   GOOD <= 0:
    :   :   :...EXCELLENT > 0: _P (3)
    :   :       EXCELLENT <= 0:
    :   :       :...GREAT > 0: _P (3)
    :   :           GREAT <= 0:
    :   :           :...CLEAN <= 0: _N (48/4)
    :   :               CLEAN > 0: _P (4/1)
    :   POOR <= 0:
    :   :...DIFFICULT > 0:
    :       :...GOOD > 0: _P (6)
    :       :   GOOD <= 0:
    :       :   :...HELPFUL <= 0: _N (34/7)
    :       :       HELPFUL > 0: _P (5)
    ...
    ...
+
    Finding the significant words

     The classification accuracy which is proportional to
     the relevancy of words was between 83 – 93%.
     Thedecision tree mostly asked if the frequency
     was > 0 or = 0 (binary representation).
     Thedecision tree provides a list of about 200-300
     words significant for classification from the
     sentiment perspective together with the
     significance (i.e. the frequency of using the words
     during classification) of the words.
     Only15 words for each language is presented
     together with their significance (column %).
+
    Handling large collections

     For
        languages with large amount of reviews the
     datasets were randomly split into subsets
     consisting of 50,000 reviews because of memory
     requirements and a decision tree was created for
     each such subset.

     Each
         of the 50,000-sample subsets gave almost the
     same list of words.

     The   relevancies of extracted words were averaged.
+
    Results
+
    Results
+
    Results
+
    Results
+
    Conclusions

    A   procedure how to apply computers, machine
      learning, and natural language processing areas to
      automatically find significant words was presented.
     From the total number of words (80,000–200,000) only
      about 200–300 were identified as significant.
     The simple, unified procedure worked well for many
      languages.
     Following research focuses on determining the
      strength of sentiment and on generating typical short
      phrases instead of only creating individual words.
     The procedure might be used during the marketing
      research or marketing intelligence, for filtering
      reviews, generating lists of key-words etc.
Thank you for your attention
Vielen Dank für Ihre Aufmerksamkeit
    Gracias por vuestra atención
      Merci de votre attention
   Grazie per la vostra attenzione
      Спасибо за ваше внимание
   ご静聴ありがとうございました
      Děkuji za vaši pozornost

Mais conteúdo relacionado

Semelhante a Additional2

Deep Machine Reading
Deep Machine ReadingDeep Machine Reading
Deep Machine ReadingNaveen Ashish
 
Taxonomy bootcamp explaining metadata - dc - nov 5 2013 - compressed
Taxonomy bootcamp   explaining metadata - dc - nov 5 2013 - compressedTaxonomy bootcamp   explaining metadata - dc - nov 5 2013 - compressed
Taxonomy bootcamp explaining metadata - dc - nov 5 2013 - compressedRuven Gotz
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
The information supernova
The information supernovaThe information supernova
The information supernovaAlaa Al-Agamawi
 
Brooke Aker Presentation
Brooke Aker PresentationBrooke Aker Presentation
Brooke Aker PresentationMediabistro
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalS. M. Hassan Zaidi
 
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...Noriaki Tatsumi
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM WatsonFindwise
 
How to Measure Important Data by fmr Booking.com Product Owner
How to Measure Important Data by fmr Booking.com Product OwnerHow to Measure Important Data by fmr Booking.com Product Owner
How to Measure Important Data by fmr Booking.com Product OwnerProduct School
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methodsvoginip
 
Learning content - Data Science Basics
Learning content - Data Science Basics Learning content - Data Science Basics
Learning content - Data Science Basics PredicSis
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 
Microservices Summit - The Human Side of Services
Microservices Summit - The Human Side of ServicesMicroservices Summit - The Human Side of Services
Microservices Summit - The Human Side of ServicesYelp Engineering
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPMENGSAYLOEM1
 
Growing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center DataGrowing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center DataBay Bridge Decision Technologies
 
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...Core Security
 

Semelhante a Additional2 (20)

Deep Machine Reading
Deep Machine ReadingDeep Machine Reading
Deep Machine Reading
 
Taxonomy bootcamp explaining metadata - dc - nov 5 2013 - compressed
Taxonomy bootcamp   explaining metadata - dc - nov 5 2013 - compressedTaxonomy bootcamp   explaining metadata - dc - nov 5 2013 - compressed
Taxonomy bootcamp explaining metadata - dc - nov 5 2013 - compressed
 
Fypca5
Fypca5Fypca5
Fypca5
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Selling Text Analytics to your boss
Selling Text Analytics to your bossSelling Text Analytics to your boss
Selling Text Analytics to your boss
 
The information supernova
The information supernovaThe information supernova
The information supernova
 
Brooke Aker Presentation
Brooke Aker PresentationBrooke Aker Presentation
Brooke Aker Presentation
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine Final
 
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
Voice Summit 2018 - Millions of Dollars in Helping Customers Through Searchin...
 
Findwise and IBM Watson
Findwise and IBM WatsonFindwise and IBM Watson
Findwise and IBM Watson
 
How to Measure Important Data by fmr Booking.com Product Owner
How to Measure Important Data by fmr Booking.com Product OwnerHow to Measure Important Data by fmr Booking.com Product Owner
How to Measure Important Data by fmr Booking.com Product Owner
 
Improving search with neural ranking methods
Improving search with neural ranking methodsImproving search with neural ranking methods
Improving search with neural ranking methods
 
Learning content - Data Science Basics
Learning content - Data Science Basics Learning content - Data Science Basics
Learning content - Data Science Basics
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
Microservices Summit - The Human Side of Services
Microservices Summit - The Human Side of ServicesMicroservices Summit - The Human Side of Services
Microservices Summit - The Human Side of Services
 
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
Evaluation Datasets for Twitter Sentiment Analysis: A survey and a new datase...
 
Beyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLPBeyond the Symbols: A 30-minute Overview of NLP
Beyond the Symbols: A 30-minute Overview of NLP
 
Growing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center DataGrowing Intelligence by Properly Storing and Mining Call Center Data
Growing Intelligence by Properly Storing and Mining Call Center Data
 
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
Introducing Core Role Designer - Michael Marks Product Manager - Identity, Co...
 

Mais de Natalia Ostapuk

Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Natalia Ostapuk
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniarNatalia Ostapuk
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12Natalia Ostapuk
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1Natalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledgeNatalia Ostapuk
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3Natalia Ostapuk
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большаковаNatalia Ostapuk
 
Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012Natalia Ostapuk
 

Mais de Natalia Ostapuk (20)

Gromov
GromovGromov
Gromov
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Aist academic writing
Aist academic writingAist academic writing
Aist academic writing
 
Ponomareva
PonomarevaPonomareva
Ponomareva
 
Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013Nlp seminar.kolomiyets.dec.2013
Nlp seminar.kolomiyets.dec.2013
 
Tomita одесса
Tomita одессаTomita одесса
Tomita одесса
 
Mt engine on nlp semniar
Mt engine on nlp semniarMt engine on nlp semniar
Mt engine on nlp semniar
 
Tomita 4марта
Tomita 4мартаTomita 4марта
Tomita 4марта
 
Konyushkova
KonyushkovaKonyushkova
Konyushkova
 
Braslavsky 13.12.12
Braslavsky 13.12.12Braslavsky 13.12.12
Braslavsky 13.12.12
 
Клышинский 8.12
Клышинский 8.12Клышинский 8.12
Клышинский 8.12
 
Zizka immm 2012
Zizka immm 2012Zizka immm 2012
Zizka immm 2012
 
Analysis by-variants
Analysis by-variantsAnalysis by-variants
Analysis by-variants
 
место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1место онтологий в современной инженерии на примере Iso 15926 v1
место онтологий в современной инженерии на примере Iso 15926 v1
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge2011 04 troussov_graph_basedmethods-weakknowledge
2011 04 troussov_graph_basedmethods-weakknowledge
 
Angelii rus
Angelii rusAngelii rus
Angelii rus
 
семинар Spb ling_v3
семинар Spb ling_v3семинар Spb ling_v3
семинар Spb ling_v3
 
17.03 большакова
17.03 большакова17.03 большакова
17.03 большакова
 
Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012Bonch-Osmolovskaya 3.3.2012
Bonch-Osmolovskaya 3.3.2012
 

Último

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 

Último (20)

4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 

Additional2

  • 1. + Jan Žižka František Dařena Department of Faculty of Business Informatics and Economics Mendel Czech University Republic in Brno MINING SIGNIFICANT WORDS FROM CUSTOMER OPINIONS WRITTEN IN DIFFERENT NATURAL LANGUAGES
  • 2. + Introduction  Many companies collect opinions expressed by their customers.  These opinions can hide valuable knowledge.  Discovering the knowledge by people can be sometimes a very demanding task because  the opinion database can be very large,  the customers can use different languages,  the people can handle the opinions subjectively,  sometimes additional resources (like lists of positive and negative words) might be needed.
  • 3. + Objective For answering the question “What is significant for including a certain opinion into one of categories like satisfied or dissatisfied customers?” automatically extract words significant for positive and negative customers' opinions and to form not too large dictionaries of these words.
  • 4. + Data description  Processed data included reviews of hotel clients collected from publicly available sources.  The reviews were labeled as positive and negative.  Reviews characteristics:  more than 5,000,000 reviews,  written in more than 25 natural languages,  written only by real customers, based on a real experience,  written relatively carefully but still containing errors that are typical for natural languages.
  • 5. + Review examples  Positive  The breakfast and the very clean rooms stood out as the best features of this hotel.  Clean and moden, the great loation near station. Friendly reception!  The rooms are new. The breakfast is also great. We had a really nice stay.  Good location - very quiet and good breakfast.  Negative  High price charged for internet access which actual cost now is extreamly low.  water in the shower did not flow away  The room was noisy and the room temperature was higher than normal.  The air conditioning wasn't working
  • 6. + Data preparation  Data collection, cleaning (removing tags, non- letter characters), converting to upper-case.  Transforming into the bag-of-words representation, term frequencies (TF) used as attribute values.  Removing the words with global frequency < 2.  Stemming, stopwords removing, spell checking, diacritics removal etc. were not carried out.
  • 7. + Data characteristics 1200000 1000000 800000 number of reviews positive 600000 negative 400000 200000 0 English French Spanish German Italian Russian Japan Czech
  • 8. + Data characteristics 250000 200000 number of unique words 150000 MinTF=1 MinTF=2 100000 50000 0 English German Japan French Spanish Italian Russian Czech
  • 9. + Finding the significant words  Thanksto having a large collection of labeled examples a classifier that separates positive and negative reviews could be created.  To reveal significant attributes (words) a decision tree was built using the tree-generating algorithm c5 (by R. Quinlan) based on entropy minimization.  The goal was not to achieve the best classification accuracy but to find relevant attributes that contribute to assigning a text to a given class.  The significant words appeared in the nodes of the decision tree.
  • 10. + An example of a decision tree LOCATION > 0: :...POOR > 0: : :...GOOD > 0: _P (13) : : GOOD <= 0: : : :...EXCELLENT > 0: _P (3) : : EXCELLENT <= 0: : : :...GREAT > 0: _P (3) : : GREAT <= 0: : : :...CLEAN <= 0: _N (48/4) : : CLEAN > 0: _P (4/1) : POOR <= 0: : :...DIFFICULT > 0: : :...GOOD > 0: _P (6) : : GOOD <= 0: : : :...HELPFUL <= 0: _N (34/7) : : HELPFUL > 0: _P (5) ... ...
  • 11. + Finding the significant words  The classification accuracy which is proportional to the relevancy of words was between 83 – 93%.  Thedecision tree mostly asked if the frequency was > 0 or = 0 (binary representation).  Thedecision tree provides a list of about 200-300 words significant for classification from the sentiment perspective together with the significance (i.e. the frequency of using the words during classification) of the words.  Only15 words for each language is presented together with their significance (column %).
  • 12. + Handling large collections  For languages with large amount of reviews the datasets were randomly split into subsets consisting of 50,000 reviews because of memory requirements and a decision tree was created for each such subset.  Each of the 50,000-sample subsets gave almost the same list of words.  The relevancies of extracted words were averaged.
  • 13. + Results
  • 14. + Results
  • 15. + Results
  • 16. + Results
  • 17. + Conclusions A procedure how to apply computers, machine learning, and natural language processing areas to automatically find significant words was presented.  From the total number of words (80,000–200,000) only about 200–300 were identified as significant.  The simple, unified procedure worked well for many languages.  Following research focuses on determining the strength of sentiment and on generating typical short phrases instead of only creating individual words.  The procedure might be used during the marketing research or marketing intelligence, for filtering reviews, generating lists of key-words etc.
  • 18. Thank you for your attention Vielen Dank für Ihre Aufmerksamkeit Gracias por vuestra atención Merci de votre attention Grazie per la vostra attenzione Спасибо за ваше внимание ご静聴ありがとうございました Děkuji za vaši pozornost