The document discusses text mining of Indian languages. It is divided into 3 phases: 1) Document representation using vector space model, 2) Document categorization based on language using classifiers like kNN and Naive Bayes, 3) Language independent keyword extraction using TF-IDF. For keyword extraction, optimal TF-IDF thresholds are determined for different languages to achieve 100% recall - 3.0910 for Kannada, 2.3979 for Tamil and 3.4095 for Telugu texts. Classification accuracy of 87.5% for kNN and 70.83% for J48 are reported. Naive Bayes is found to be most efficient for Indian language text categorization.
1. India is multilingual nation. Text mining is a growing
research area in data mining. So the aim is conduct a
detailed study on text mining on Indian language. The
objectives of the research work are as follows
1. To design a method for Indian language documents
representation
2. To propose an algorithm to categorize documents based
on language and domain.
3. To design a language independent algorithm to extract
the keywords from all the Indian language documents.
1
Third DC Meeting
5. The research work is
divided into 3 phases
Documents
Representation
Vector
Space Model
Properties of
Corpus
Document
categorization
Based on the
language
Language
Independent
Classifier
Keywords
extraction
Using
TF, IDF,
TFIDF
5
Third DC Meeting
6. Data Preprocessing means
converting unstructured data into
structured data.
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
After preprocessing data mining
algorithms can be applied.
In Text mining study, a document
is used as a basic unit of analysis.
To analyze the document, the first
step is data preprocessing
6
Third DC Meeting
7. Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
7
Third DC Meeting
8. Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning /
pattern recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The
availability of constantly increasing amount of textual data of
various Indian regional languages in electronic form has
accelerated. Therefore Indian language text mining is required.
8
Third DC Meeting
9. Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning /
pattern recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The
availability of constantly increasing amount of textual data of
various Indian regional languages in electronic form has
accelerated. Therefore Indian language text mining is required.
9
Third DC Meeting
10. Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
10
Third DC Meeting
11. Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
11
Third DC Meeting
12. Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
Stemming is the term used to describe
the process for reducing derived words
to their root form
Example:
"cats“, "catlike", "catty“ -> "cat",
"stemmer", "stemming", "stemmed" ->
"stem“
"fishing", "fished", and "fisher" -> "fish“
"argue", "argued", "argues“, "arguing",
"argus" ->"argu”
12
Third DC Meeting
14. language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
14
Third DC Meeting
Zipf’s Law describes the
word behavior in an
entire corpus
15. language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
15
Third DC Meeting
In natural language, there
are a few frequent terms
and many rare terms.
16. language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
16
Third DC Meeting
Frequency * rank = constant.
So frequency of a word is
inversely proportional to its
rank.
17. Objective: To classify the documents based on language.
Documents
Classifier
Tamil Language
Documents
Kannada
Language
Documents
Telugu
Language
Documents
17
Third DC Meeting
18. Algorithm
1. Identify specific language files.
2. Associate a Language label with each of the files.
3. Build a Corpus C
4. Preprocess the Corpus C.
5. Apply a Stemming algorithm to reduce all the words to their root
form.
6. Generate VSM or a Term Document matrix using Binary Term
Occurrence D( i, j where i is the document i and j is the jth term of
document i.)
7. Train the Classifier (kNN,j48 and NB) using C as training examples.
18
Third DC Meeting
19. Confusion Matrix
kNN Classifier j48 Classifier NB Classifier
Kannada Tamil Telugu Kannada Tamil Telugu Kannada Tamil Telugu
87 2 11 99 1 0 100 0 0
2 96 2 1 97 2 2 98 0
4 0 96 4 0 96 5 0 95
19
Third DC Meeting
20. Data mining algorithms are used for English text categorization,
similarly they can be applied for Indian language text categorization
The effectiveness of classification algorithm
kNN gives 93% accuracy
Decision tree gives 97.33% accuracy
Naïve Bayes gives97.66% accuracy.
Naïve Bayes is efficient algorithm for Indian language text
categorization.
20
Third DC Meeting
21. The objective of this work is to design a language independent
classifier to categories the documents based on domain.
Documents
Cinema Sports Politics
Language
Independent
Classifier
21
Third DC Meeting
22. S N Domain
No of
document
s
No of
Tokens
No of vocabulary
No of vocabulary
after removing
Stop-words
Case 1
Cinema 5 1378 978 943
Politics 5 831 560 537
Sports 5 712 426 398
Case 2
Cinema 48 13463 5302 5107
Sports 48 24096 8156 7934
22
Third DC Meeting
23. The prediction of the classify models for case 1 are tabulated in the form of
confusion matrix.
kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66%
J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00%
23
Third DC Meeting
24. The prediction of the classify models for case 2 are tabulated in the form of confusion matrix.
kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5%
J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83%
24
Third DC Meeting
25. In case 1, only five documents of three domains (Cinema, Sports, and Politics)
For measuring the accuracy of classification algorithm, confusion matrix is used.
kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66%
J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00%
In case 2, 48 documents of two domains (Cinema, Sports)
For measuring the accuracy of classification algorithm, confusion matrix is used.
kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5%
J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83%
25
Third DC Meeting
26. Objective:
Keyword extraction is the
task to identify a small set of
words or keywords from a
document that can describes
the meaning of the document.
It should be done
systematically without human
intervention. It should be
language independent model
Indian Language Document
Data Preprocessing
Candidate Keywords
TF IDF
Ranking
IDF
Selection of Keywords with
line/page no
26
Third DC Meeting
27. Algorithm
1. Dravidian language text document is tokenized
2. Stop words and frequent words elimination to get vocabulary words.
3. Vocabulary words are stored in the form of matrix called Vector space
model
4. Term frequency, Inverse document frequency and TF*IDF for each word is
calculated
5. Select the vocabulary words by fixing threshold value for TF*IDF.
6. Along with the keywords the corresponding line number or paragraph
number or page number is also extracted
27
Third DC Meeting
28. Recall =
No. of relevant documents
retrieved
Total no. of relevant
documents in the corpus
Recall = TP / (TP+FN) × 100
Precision = TP / (TP+FP) × 100
Recall Precision
Precision
=
No. of relevant documents
retrieved
Total no. of documents
retrieved from the corpus
28
Third DC Meeting
29. In the case of Tamil text, when
TFIDF is 2.3979 the recall is
100%. Therefore we are using
2.3979 as a TFIDF threshold
value to extract the keywords. So
we are considering those words
whose TFIDF value is grater then
2.3979 as keywords
29
Third DC Meeting
30. In the case of Kannada text, when
TF*IDF is 3.0910 the recall is
100%. Therefore we are using
3.0910 as a TFIDF threshold value
to extract the keywords. So we
are considering those words
whose TFIDF value is grater then
3.0910 as keywords
30
Third DC Meeting
31. In the case of Telugu text,(Fig
6) when TFIDF is 3.4095 the
recall is 100%. Therefore we
are using 3.4095 as a TFIDF
threshold value to extract the
keywords. So we are
considering those words
whose TFIDF value is grater
then 3.4095 as keywords
31
Third DC Meeting
32. Phase 3: Keyword extraction from Telugu Language
In the case of Telugu text,(Fig 6)
when TFIDF is 3.4095 the recall is
100%. Therefore we are using
3.4095 as a TFIDF threshold value
to extract the keywords. So we are
considering those words whose
TFIDF value is grater then 3.4095 as
keywords
32
Third DC Meeting
33. The third phase of work is keyword extraction. TF*IDF is a used to evaluate
how important is a word in a document. The TF*IDF is used as a threshold to
select the important keyword. In the case of Kannada text, when TFIDF is
3.0910 the recall is 100%. Therefore 3.0910 is used as a TFIDF threshold
value to extract the keywords. In the case of Tamil text, when TFIDF is 2.3979
the recall is 100%. Therefore 2.3979 is used as a TFIDF threshold value to
extract the keywords. In the case of Telugu text, when TFIDF is 3.4095 the
recall is 100%. Therefore 3.4095 is used as a TFIDF threshold value to extract
the keywords.
33
Third DC Meeting