SlideShare uma empresa Scribd logo
1 de 13
Baixar para ler offline
Text Mining
Barbara Barbosa @bahbbc
BankFacil
26th February 2016
Barbara Barbosa @bahbbc BankFacil
Text Mining
What is it?
The process to deriving information from the text. It usually
requires a preprocessing of the input data.
Barbara Barbosa @bahbbc BankFacil
Text Mining
Learning problem
Figure: Flow chart of learning problem
Barbara Barbosa @bahbbc BankFacil
Text Mining
Corpus
Corpus is the set of n documents. Each of these documents is
defined as a set of m terms (radicals, words or a set of words).
The corpus will be all text available by clients from the BankFacil’s
page on facebook (https://www.facebook.com/bankfacil)
You can check the code in R - http://bit.ly/1XQ0mWw
Barbara Barbosa @bahbbc BankFacil
Text Mining
Tokenizing - Lexical Analysis
Convert to lower case
Remove punctuation
Remove numbers
Barbara Barbosa @bahbbc BankFacil
Text Mining
StopWords
Stopwords 1 is a list of words that doesn’t have the potential to
contribute to characterize the content in the text.
They can reduce the size of texts by 30% to 50%.
1
Portuguese stopwords available at:
http://snowball.tartarus.org/algorithms/portuguese/stop.txt
Barbara Barbosa @bahbbc BankFacil
Text Mining
Stemming
Figure:
There are experiments that shows 5% of reduction from the
document original size.
Barbara Barbosa @bahbbc BankFacil
Text Mining
Space Vector Model
Binary
Frequency
tf-idf
tf-idf normalized
Barbara Barbosa @bahbbc BankFacil
Text Mining
TF-IDF
TF-IDF (Term Frequency - Inverse Document Frequency)
tfidf(tk, dj) = #(tk, dj) ∗ log
|#Tr|
Tr(tk)
(1)
Tr - representa o n´umero total de documentos (corpus)
#(tk, dj) - o n´umero de vezes que tk ocorre em dj
Tr(tk) - n´umero de documentos em Tr em que tk aparece
Barbara Barbosa @bahbbc BankFacil
Text Mining
Luhn’s experiment
Figure:
Barbara Barbosa @bahbbc BankFacil
Text Mining
Zipf’s law
Zipf’s law states that given some corpus, the frequency of any
word is inversely proportional to its rank in the frequency table.
More about Zipf’s law
https://www.youtube.com/watch?v=fCn8zs912OE
Barbara Barbosa @bahbbc BankFacil
Text Mining
Bibliography
Based on slides from Prof. Sarajane Marques Peres in Data Mining
course
Barbara Barbosa @bahbbc BankFacil
Text Mining
Text Mining
Barbara Barbosa @bahbbc
BankFacil
26th February 2016
Barbara Barbosa @bahbbc BankFacil
Text Mining

Mais conteúdo relacionado

Mais procurados

Mining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalMining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalramya marichamy
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing Sandeep Wakchaure
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information RetrievalDishant Ailawadi
 
Text clustering
Text clusteringText clustering
Text clusteringKU Leuven
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)Marina Santini
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Miningsathish sak
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspectsiwan_rg
 

Mais procurados (20)

Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 
Mining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalMining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactional
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
Extracting keywords from texts - Sanda Martincic Ipsic
Extracting keywords from texts - Sanda Martincic IpsicExtracting keywords from texts - Sanda Martincic Ipsic
Extracting keywords from texts - Sanda Martincic Ipsic
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Spell checker using Natural language processing
Spell checker using Natural language processing Spell checker using Natural language processing
Spell checker using Natural language processing
 
Text MIning
Text MIningText MIning
Text MIning
 
Evaluation in Information Retrieval
Evaluation in Information RetrievalEvaluation in Information Retrieval
Evaluation in Information Retrieval
 
Text clustering
Text clusteringText clustering
Text clustering
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Inverted index
Inverted indexInverted index
Inverted index
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Automatic text simplification evaluation aspects
Automatic text simplification  evaluation aspectsAutomatic text simplification  evaluation aspects
Automatic text simplification evaluation aspects
 

Destaque

Positive reinforcement and statistics
Positive reinforcement and statisticsPositive reinforcement and statistics
Positive reinforcement and statisticsCreditas
 
Modern sql
Modern sqlModern sql
Modern sqlCreditas
 
Jurassic JavaScript Park - Rodando Offline até na ilha Nublar!
Jurassic JavaScript Park - Rodando Offline até na ilha Nublar!Jurassic JavaScript Park - Rodando Offline até na ilha Nublar!
Jurassic JavaScript Park - Rodando Offline até na ilha Nublar!Creditas
 
How To Get Organized
How To Get OrganizedHow To Get Organized
How To Get OrganizedCreditas
 
Rails in the bowels
Rails in the bowelsRails in the bowels
Rails in the bowelsCreditas
 
A arte de ser Mensch
A arte de ser MenschA arte de ser Mensch
A arte de ser MenschCreditas
 
Debugging with pry
Debugging with pryDebugging with pry
Debugging with pryCreditas
 
O que o seu texto diz sobre você
O que o seu texto diz sobre vocêO que o seu texto diz sobre você
O que o seu texto diz sobre vocêCreditas
 
An introduction to MySQL
An introduction to MySQLAn introduction to MySQL
An introduction to MySQLCreditas
 
Como melhorar sua comunicação com sua equipe, sua mãe e seu chefe
Como melhorar sua comunicação com sua equipe, sua mãe e seu chefeComo melhorar sua comunicação com sua equipe, sua mãe e seu chefe
Como melhorar sua comunicação com sua equipe, sua mãe e seu chefeCreditas
 
O paradoxo da escolha
O paradoxo da escolhaO paradoxo da escolha
O paradoxo da escolhaCreditas
 
การ Normalization
การ Normalizationการ Normalization
การ Normalizationskiats
 

Destaque (16)

Text Mining - Data Mining
Text Mining - Data MiningText Mining - Data Mining
Text Mining - Data Mining
 
Text Mining and Thai NLP
Text Mining and Thai NLP Text Mining and Thai NLP
Text Mining and Thai NLP
 
Positive reinforcement and statistics
Positive reinforcement and statisticsPositive reinforcement and statistics
Positive reinforcement and statistics
 
Modern sql
Modern sqlModern sql
Modern sql
 
Jurassic JavaScript Park - Rodando Offline até na ilha Nublar!
Jurassic JavaScript Park - Rodando Offline até na ilha Nublar!Jurassic JavaScript Park - Rodando Offline até na ilha Nublar!
Jurassic JavaScript Park - Rodando Offline até na ilha Nublar!
 
How To Get Organized
How To Get OrganizedHow To Get Organized
How To Get Organized
 
Sublime
SublimeSublime
Sublime
 
Rails in the bowels
Rails in the bowelsRails in the bowels
Rails in the bowels
 
A arte de ser Mensch
A arte de ser MenschA arte de ser Mensch
A arte de ser Mensch
 
Debugging with pry
Debugging with pryDebugging with pry
Debugging with pry
 
O que o seu texto diz sobre você
O que o seu texto diz sobre vocêO que o seu texto diz sobre você
O que o seu texto diz sobre você
 
An introduction to MySQL
An introduction to MySQLAn introduction to MySQL
An introduction to MySQL
 
Como melhorar sua comunicação com sua equipe, sua mãe e seu chefe
Como melhorar sua comunicação com sua equipe, sua mãe e seu chefeComo melhorar sua comunicação com sua equipe, sua mãe e seu chefe
Como melhorar sua comunicação com sua equipe, sua mãe e seu chefe
 
O paradoxo da escolha
O paradoxo da escolhaO paradoxo da escolha
O paradoxo da escolha
 
การ Normalization
การ Normalizationการ Normalization
การ Normalization
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 

Mais de Creditas

Hanami & Domain-Driven Design
Hanami & Domain-Driven DesignHanami & Domain-Driven Design
Hanami & Domain-Driven DesignCreditas
 
Application layer
Application layerApplication layer
Application layerCreditas
 
Hanami with a modern touch
Hanami with a modern touchHanami with a modern touch
Hanami with a modern touchCreditas
 
Melanoma: how to detect skin cancer
Melanoma: how to detect skin cancerMelanoma: how to detect skin cancer
Melanoma: how to detect skin cancerCreditas
 
Rails Girls - RubyConfBR 2015
Rails Girls - RubyConfBR 2015Rails Girls - RubyConfBR 2015
Rails Girls - RubyConfBR 2015Creditas
 
GTD - Getting Things Done
GTD - Getting Things DoneGTD - Getting Things Done
GTD - Getting Things DoneCreditas
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data setsCreditas
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learnCreditas
 
OOCSS and SMACSS
OOCSS and SMACSSOOCSS and SMACSS
OOCSS and SMACSSCreditas
 
Solid - OOD Principles
Solid - OOD PrinciplesSolid - OOD Principles
Solid - OOD PrinciplesCreditas
 
Sistemas de recomendação em ruby
Sistemas de recomendação em rubySistemas de recomendação em ruby
Sistemas de recomendação em rubyCreditas
 
Do MONOLÍTICO à arquitetura distribuída
Do MONOLÍTICO à arquitetura distribuídaDo MONOLÍTICO à arquitetura distribuída
Do MONOLÍTICO à arquitetura distribuídaCreditas
 
Deploy Heroku
Deploy HerokuDeploy Heroku
Deploy HerokuCreditas
 

Mais de Creditas (15)

Hanami & Domain-Driven Design
Hanami & Domain-Driven DesignHanami & Domain-Driven Design
Hanami & Domain-Driven Design
 
Application layer
Application layerApplication layer
Application layer
 
Hanami with a modern touch
Hanami with a modern touchHanami with a modern touch
Hanami with a modern touch
 
Melanoma: how to detect skin cancer
Melanoma: how to detect skin cancerMelanoma: how to detect skin cancer
Melanoma: how to detect skin cancer
 
Rails Girls - RubyConfBR 2015
Rails Girls - RubyConfBR 2015Rails Girls - RubyConfBR 2015
Rails Girls - RubyConfBR 2015
 
GTD - Getting Things Done
GTD - Getting Things DoneGTD - Getting Things Done
GTD - Getting Things Done
 
Pig - Analyzing data sets
Pig - Analyzing data setsPig - Analyzing data sets
Pig - Analyzing data sets
 
Neo4 j
Neo4 jNeo4 j
Neo4 j
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
 
OOCSS and SMACSS
OOCSS and SMACSSOOCSS and SMACSS
OOCSS and SMACSS
 
Solid - OOD Principles
Solid - OOD PrinciplesSolid - OOD Principles
Solid - OOD Principles
 
Sistemas de recomendação em ruby
Sistemas de recomendação em rubySistemas de recomendação em ruby
Sistemas de recomendação em ruby
 
Do MONOLÍTICO à arquitetura distribuída
Do MONOLÍTICO à arquitetura distribuídaDo MONOLÍTICO à arquitetura distribuída
Do MONOLÍTICO à arquitetura distribuída
 
Minitest
MinitestMinitest
Minitest
 
Deploy Heroku
Deploy HerokuDeploy Heroku
Deploy Heroku
 

Último

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 

Último (20)

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

Text mining Pre-processing

  • 1. Text Mining Barbara Barbosa @bahbbc BankFacil 26th February 2016 Barbara Barbosa @bahbbc BankFacil Text Mining
  • 2. What is it? The process to deriving information from the text. It usually requires a preprocessing of the input data. Barbara Barbosa @bahbbc BankFacil Text Mining
  • 3. Learning problem Figure: Flow chart of learning problem Barbara Barbosa @bahbbc BankFacil Text Mining
  • 4. Corpus Corpus is the set of n documents. Each of these documents is defined as a set of m terms (radicals, words or a set of words). The corpus will be all text available by clients from the BankFacil’s page on facebook (https://www.facebook.com/bankfacil) You can check the code in R - http://bit.ly/1XQ0mWw Barbara Barbosa @bahbbc BankFacil Text Mining
  • 5. Tokenizing - Lexical Analysis Convert to lower case Remove punctuation Remove numbers Barbara Barbosa @bahbbc BankFacil Text Mining
  • 6. StopWords Stopwords 1 is a list of words that doesn’t have the potential to contribute to characterize the content in the text. They can reduce the size of texts by 30% to 50%. 1 Portuguese stopwords available at: http://snowball.tartarus.org/algorithms/portuguese/stop.txt Barbara Barbosa @bahbbc BankFacil Text Mining
  • 7. Stemming Figure: There are experiments that shows 5% of reduction from the document original size. Barbara Barbosa @bahbbc BankFacil Text Mining
  • 8. Space Vector Model Binary Frequency tf-idf tf-idf normalized Barbara Barbosa @bahbbc BankFacil Text Mining
  • 9. TF-IDF TF-IDF (Term Frequency - Inverse Document Frequency) tfidf(tk, dj) = #(tk, dj) ∗ log |#Tr| Tr(tk) (1) Tr - representa o n´umero total de documentos (corpus) #(tk, dj) - o n´umero de vezes que tk ocorre em dj Tr(tk) - n´umero de documentos em Tr em que tk aparece Barbara Barbosa @bahbbc BankFacil Text Mining
  • 10. Luhn’s experiment Figure: Barbara Barbosa @bahbbc BankFacil Text Mining
  • 11. Zipf’s law Zipf’s law states that given some corpus, the frequency of any word is inversely proportional to its rank in the frequency table. More about Zipf’s law https://www.youtube.com/watch?v=fCn8zs912OE Barbara Barbosa @bahbbc BankFacil Text Mining
  • 12. Bibliography Based on slides from Prof. Sarajane Marques Peres in Data Mining course Barbara Barbosa @bahbbc BankFacil Text Mining
  • 13. Text Mining Barbara Barbosa @bahbbc BankFacil 26th February 2016 Barbara Barbosa @bahbbc BankFacil Text Mining