Third DC Meeting1.ppt

India is multilingual nation. Text mining is a growing
research area in data mining. So the aim is conduct a
detailed study on text mining on Indian language. The
objectives of the research work are as follows
1. To design a method for Indian language documents
representation
2. To propose an algorithm to categorize documents based
on language and domain.
3. To design a language independent algorithm to extract
the keywords from all the Indian language documents.
1
Third DC Meeting

Kannada
ನಮಸ್ಕಾ ರ, ಶುಭ
ಮುಂಜಾನೆ
ಶುಭ ಮಧ್ಯಾ ಹ್ನ , ಶುಭ
ರಾತ್ರ
ಿ
ಶುಭ ಹಾರೈಕೆ (ಗುಡ್
ಬೈ)
ಧನಾ ವಾದಗಳು
namaskAra ,shubha
muMjAne
shubha madhyAhna,
shubha rAtri
shubha hAraike (guD
bai)
dhanyavAdagaLu
Tamil
வணக்கம், (காலை)
வணக்கம், (மதிய)
வணக்கம்
நை்லிரவாக
அலமயட் டும்
சென
் று வருகிறேன
்
நன
் றி
vaNakkam (kAlai)
vaNakkam (matiya)
vaNakkam
~nalliravAka
amaiyaTTum
cenRu varukiREn
~nanRi
Telugu
హలో, నమస్కా రం
నమస్కా రం, నమస్తే
నమస్తే, కృతజ్ఞతలు
halO, namaskAraM
namaskAraM ,
namastE namastE ,
kRutaj~jatalu
4
Third DC Meeting

The research work is
divided into 3 phases
Documents
Representation
Vector
Space Model
Properties of
Corpus
Document
categorization
Based on the
language
Language
Independent
Classifier
Keywords
extraction
Using
TF, IDF,
TFIDF
5
Third DC Meeting

Data Preprocessing means
converting unstructured data into
structured data.
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
After preprocessing data mining
algorithms can be applied.
In Text mining study, a document
is used as a basic unit of analysis.
To analyze the document, the first
step is data preprocessing
6
Third DC Meeting

Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
7
Third DC Meeting

Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
* Loose definition:
data sets.
intelligence (natural language processing, machine learning /
pattern recognition)
* World Wide Web
communicating at the state level for official purpose. The
availability of constantly increasing amount of textual data of
various Indian regional languages in electronic form has
accelerated. Therefore Indian language text mining is required.
8
Third DC Meeting

Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
* Loose definition:
data sets.
intelligence (natural language processing, machine learning /
pattern recognition)
* World Wide Web
communicating at the state level for official purpose. The
availability of constantly increasing amount of textual data of
various Indian regional languages in electronic form has
accelerated. Therefore Indian language text mining is required.
9
Third DC Meeting

Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
* Loose definition:
data sets.
recognition)
* World Wide Web
10
Third DC Meeting

Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
* Loose definition:
data sets.
recognition)
* World Wide Web
11
Third DC Meeting

Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
Stemming is the term used to describe
the process for reducing derived words
to their root form
Example:
"cats“, "catlike", "catty“ -> "cat",
 "stemmer", "stemming", "stemmed" ->
"stem“
"fishing", "fished", and "fisher" -> "fish“
"argue", "argued", "argues“, "arguing",
"argus" ->"argu”
12
Third DC Meeting

Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
W1 W2 W3 W4 W5
DOC1 1 0 2 1 0
DOC2 2 0 1 0 3
DOC3 0 1 2 2 0
dij represents number of
times that term appears
in the document
13
Third DC Meeting

language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
14
Third DC Meeting
Zipf’s Law describes the
word behavior in an
entire corpus

Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
15
Third DC Meeting
In natural language, there
are a few frequent terms
and many rare terms.

Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
16
Third DC Meeting
Frequency * rank = constant.
So frequency of a word is
inversely proportional to its
rank.

Objective: To classify the documents based on language.
Documents
Classifier
Tamil Language
Documents
Kannada
Language
Documents
Telugu
Language
Documents
17
Third DC Meeting

Algorithm
1. Identify specific language files.
2. Associate a Language label with each of the files.
3. Build a Corpus C
4. Preprocess the Corpus C.
5. Apply a Stemming algorithm to reduce all the words to their root
form.
6. Generate VSM or a Term Document matrix using Binary Term
Occurrence D( i, j where i is the document i and j is the jth term of
document i.)
7. Train the Classifier (kNN,j48 and NB) using C as training examples.
18
Third DC Meeting

Confusion Matrix
kNN Classifier j48 Classifier NB Classifier
Kannada Tamil Telugu Kannada Tamil Telugu Kannada Tamil Telugu
87 2 11 99 1 0 100 0 0
2 96 2 1 97 2 2 98 0
4 0 96 4 0 96 5 0 95
19
Third DC Meeting

Data mining algorithms are used for English text categorization,
similarly they can be applied for Indian language text categorization
The effectiveness of classification algorithm
kNN gives 93% accuracy
Decision tree gives 97.33% accuracy
Naïve Bayes gives97.66% accuracy.
Naïve Bayes is efficient algorithm for Indian language text
categorization.
20
Third DC Meeting

The objective of this work is to design a language independent
classifier to categories the documents based on domain.
Documents
Cinema Sports Politics
Language
Independent
Classifier
21
Third DC Meeting

S N Domain
No of
document
s
No of
Tokens
No of vocabulary
No of vocabulary
after removing
Stop-words
Case 1
Cinema 5 1378 978 943
Politics 5 831 560 537
Sports 5 712 426 398
Case 2
Cinema 48 13463 5302 5107
Sports 48 24096 8156 7934
22
Third DC Meeting

The prediction of the classify models for case 1 are tabulated in the form of
confusion matrix.
kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66%
J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00%
23
Third DC Meeting

The prediction of the classify models for case 2 are tabulated in the form of confusion matrix.
kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5%
J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83%
24
Third DC Meeting

In case 1, only five documents of three domains (Cinema, Sports, and Politics)
For measuring the accuracy of classification algorithm, confusion matrix is used.
 kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66%
J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00%
In case 2, 48 documents of two domains (Cinema, Sports)
For measuring the accuracy of classification algorithm, confusion matrix is used.
kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5%
J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83%
25
Third DC Meeting

Objective:
 Keyword extraction is the
task to identify a small set of
words or keywords from a
document that can describes
the meaning of the document.
 It should be done
systematically without human
intervention. It should be
language independent model
Indian Language Document
Data Preprocessing
Candidate Keywords
TF IDF
Ranking
IDF
Selection of Keywords with
line/page no
26
Third DC Meeting

Algorithm
1. Dravidian language text document is tokenized
2. Stop words and frequent words elimination to get vocabulary words.
3. Vocabulary words are stored in the form of matrix called Vector space
model
4. Term frequency, Inverse document frequency and TF*IDF for each word is
calculated
5. Select the vocabulary words by fixing threshold value for TF*IDF.
6. Along with the keywords the corresponding line number or paragraph
number or page number is also extracted
27
Third DC Meeting

Recall =
No. of relevant documents
retrieved
Total no. of relevant
documents in the corpus
Recall = TP / (TP+FN) × 100
Precision = TP / (TP+FP) × 100
Recall Precision
Precision
=
No. of relevant documents
retrieved
Total no. of documents
retrieved from the corpus
28
Third DC Meeting

In the case of Tamil text, when
TFIDF is 2.3979 the recall is
100%. Therefore we are using
2.3979 as a TFIDF threshold
value to extract the keywords. So
we are considering those words
whose TFIDF value is grater then
2.3979 as keywords
29
Third DC Meeting

In the case of Kannada text, when
TF*IDF is 3.0910 the recall is
3.0910 as a TFIDF threshold value
to extract the keywords. So we
are considering those words
whose TFIDF value is grater then
3.0910 as keywords
30
Third DC Meeting

In the case of Telugu text,(Fig
6) when TFIDF is 3.4095 the
recall is 100%. Therefore we
are using 3.4095 as a TFIDF
threshold value to extract the
keywords. So we are
considering those words
whose TFIDF value is grater
then 3.4095 as keywords
31
Third DC Meeting

Phase 3: Keyword extraction from Telugu Language
In the case of Telugu text,(Fig 6)
when TFIDF is 3.4095 the recall is
3.4095 as a TFIDF threshold value
to extract the keywords. So we are
considering those words whose
TFIDF value is grater then 3.4095 as
keywords
32
Third DC Meeting

The third phase of work is keyword extraction. TF*IDF is a used to evaluate
how important is a word in a document. The TF*IDF is used as a threshold to
select the important keyword. In the case of Kannada text, when TFIDF is
3.0910 the recall is 100%. Therefore 3.0910 is used as a TFIDF threshold
value to extract the keywords. In the case of Tamil text, when TFIDF is 2.3979
the recall is 100%. Therefore 2.3979 is used as a TFIDF threshold value to
extract the keywords. In the case of Telugu text, when TFIDF is 3.4095 the
recall is 100%. Therefore 3.4095 is used as a TFIDF threshold value to extract
the keywords.
33
Third DC Meeting

Third DC Meeting1.ppt

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Third DC Meeting1.ppt

Semelhante a Third DC Meeting1.ppt (20)

Último

Último (20)

Third DC Meeting1.ppt