Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Text mining Pre-processing
1. Text Mining
Barbara Barbosa @bahbbc
BankFacil
26th February 2016
Barbara Barbosa @bahbbc BankFacil
Text Mining
2. What is it?
The process to deriving information from the text. It usually
requires a preprocessing of the input data.
Barbara Barbosa @bahbbc BankFacil
Text Mining
4. Corpus
Corpus is the set of n documents. Each of these documents is
defined as a set of m terms (radicals, words or a set of words).
The corpus will be all text available by clients from the BankFacil’s
page on facebook (https://www.facebook.com/bankfacil)
You can check the code in R - http://bit.ly/1XQ0mWw
Barbara Barbosa @bahbbc BankFacil
Text Mining
5. Tokenizing - Lexical Analysis
Convert to lower case
Remove punctuation
Remove numbers
Barbara Barbosa @bahbbc BankFacil
Text Mining
6. StopWords
Stopwords 1 is a list of words that doesn’t have the potential to
contribute to characterize the content in the text.
They can reduce the size of texts by 30% to 50%.
1
Portuguese stopwords available at:
http://snowball.tartarus.org/algorithms/portuguese/stop.txt
Barbara Barbosa @bahbbc BankFacil
Text Mining
9. TF-IDF
TF-IDF (Term Frequency - Inverse Document Frequency)
tfidf(tk, dj) = #(tk, dj) ∗ log
|#Tr|
Tr(tk)
(1)
Tr - representa o n´umero total de documentos (corpus)
#(tk, dj) - o n´umero de vezes que tk ocorre em dj
Tr(tk) - n´umero de documentos em Tr em que tk aparece
Barbara Barbosa @bahbbc BankFacil
Text Mining
11. Zipf’s law
Zipf’s law states that given some corpus, the frequency of any
word is inversely proportional to its rank in the frequency table.
More about Zipf’s law
https://www.youtube.com/watch?v=fCn8zs912OE
Barbara Barbosa @bahbbc BankFacil
Text Mining
12. Bibliography
Based on slides from Prof. Sarajane Marques Peres in Data Mining
course
Barbara Barbosa @bahbbc BankFacil
Text Mining
13. Text Mining
Barbara Barbosa @bahbbc
BankFacil
26th February 2016
Barbara Barbosa @bahbbc BankFacil
Text Mining