SlideShare uma empresa Scribd logo
1 de 33
India is multilingual nation. Text mining is a growing
research area in data mining. So the aim is conduct a
detailed study on text mining on Indian language. The
objectives of the research work are as follows
1. To design a method for Indian language documents
representation
2. To propose an algorithm to categorize documents based
on language and domain.
3. To design a language independent algorithm to extract
the keywords from all the Indian language documents.
1
Third DC Meeting
2
Third DC Meeting
3
Third DC Meeting
Kannada
ನಮಸ್ಕಾ ರ, ಶುಭ
ಮುಂಜಾನೆ
ಶುಭ ಮಧ್ಯಾ ಹ್ನ , ಶುಭ
ರಾತ್ರ
ಿ
ಶುಭ ಹಾರೈಕೆ (ಗುಡ್
ಬೈ)
ಧನಾ ವಾದಗಳು
namaskAra ,shubha
muMjAne
shubha madhyAhna,
shubha rAtri
shubha hAraike (guD
bai)
dhanyavAdagaLu
Tamil
வணக்கம், (காலை)
வணக்கம், (மதிய)
வணக்கம்
நை்லிரவாக
அலமயட் டும்
சென
் று வருகிறேன
்
நன
் றி
vaNakkam (kAlai)
vaNakkam (matiya)
vaNakkam
~nalliravAka
amaiyaTTum
cenRu varukiREn
~nanRi
Telugu
హలో, నమస్కా రం
నమస్కా రం, నమస్తే
నమస్తే, కృతజ్ఞతలు
halO, namaskAraM
namaskAraM ,
namastE namastE ,
kRutaj~jatalu
4
Third DC Meeting
The research work is
divided into 3 phases
Documents
Representation
Vector
Space Model
Properties of
Corpus
Document
categorization
Based on the
language
Language
Independent
Classifier
Keywords
extraction
Using
TF, IDF,
TFIDF
5
Third DC Meeting
Data Preprocessing means
converting unstructured data into
structured data.
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
After preprocessing data mining
algorithms can be applied.
In Text mining study, a document
is used as a basic unit of analysis.
To analyze the document, the first
step is data preprocessing
6
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
7
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning /
pattern recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The
availability of constantly increasing amount of textual data of
various Indian regional languages in electronic form has
accelerated. Therefore Indian language text mining is required.
8
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning /
pattern recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The
availability of constantly increasing amount of textual data of
various Indian regional languages in electronic form has
accelerated. Therefore Indian language text mining is required.
9
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
10
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
11
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
Stemming is the term used to describe
the process for reducing derived words
to their root form
Example:
"cats“, "catlike", "catty“ -> "cat",
 "stemmer", "stemming", "stemmed" ->
"stem“
"fishing", "fished", and "fisher" -> "fish“
"argue", "argued", "argues“, "arguing",
"argus" ->"argu”
12
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
W1 W2 W3 W4 W5
DOC1 1 0 2 1 0
DOC2 2 0 1 0 3
DOC3 0 1 2 2 0
dij represents number of
times that term appears
in the document
13
Third DC Meeting
language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
14
Third DC Meeting
Zipf’s Law describes the
word behavior in an
entire corpus
language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
15
Third DC Meeting
In natural language, there
are a few frequent terms
and many rare terms.
language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
16
Third DC Meeting
Frequency * rank = constant.
So frequency of a word is
inversely proportional to its
rank.
Objective: To classify the documents based on language.
Documents
Classifier
Tamil Language
Documents
Kannada
Language
Documents
Telugu
Language
Documents
17
Third DC Meeting
Algorithm
1. Identify specific language files.
2. Associate a Language label with each of the files.
3. Build a Corpus C
4. Preprocess the Corpus C.
5. Apply a Stemming algorithm to reduce all the words to their root
form.
6. Generate VSM or a Term Document matrix using Binary Term
Occurrence D( i, j where i is the document i and j is the jth term of
document i.)
7. Train the Classifier (kNN,j48 and NB) using C as training examples.
18
Third DC Meeting
Confusion Matrix
kNN Classifier j48 Classifier NB Classifier
Kannada Tamil Telugu Kannada Tamil Telugu Kannada Tamil Telugu
87 2 11 99 1 0 100 0 0
2 96 2 1 97 2 2 98 0
4 0 96 4 0 96 5 0 95
19
Third DC Meeting
Data mining algorithms are used for English text categorization,
similarly they can be applied for Indian language text categorization
The effectiveness of classification algorithm
kNN gives 93% accuracy
Decision tree gives 97.33% accuracy
Naïve Bayes gives97.66% accuracy.
Naïve Bayes is efficient algorithm for Indian language text
categorization.
20
Third DC Meeting
The objective of this work is to design a language independent
classifier to categories the documents based on domain.
Documents
Cinema Sports Politics
Language
Independent
Classifier
21
Third DC Meeting
S N Domain
No of
document
s
No of
Tokens
No of vocabulary
No of vocabulary
after removing
Stop-words
Case 1
Cinema 5 1378 978 943
Politics 5 831 560 537
Sports 5 712 426 398
Case 2
Cinema 48 13463 5302 5107
Sports 48 24096 8156 7934
22
Third DC Meeting
The prediction of the classify models for case 1 are tabulated in the form of
confusion matrix.
kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66%
J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00%
23
Third DC Meeting
The prediction of the classify models for case 2 are tabulated in the form of confusion matrix.
kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5%
J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83%
24
Third DC Meeting
In case 1, only five documents of three domains (Cinema, Sports, and Politics)
For measuring the accuracy of classification algorithm, confusion matrix is used.
 kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66%
J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00%
In case 2, 48 documents of two domains (Cinema, Sports)
For measuring the accuracy of classification algorithm, confusion matrix is used.
kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5%
J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83%
25
Third DC Meeting
Objective:
 Keyword extraction is the
task to identify a small set of
words or keywords from a
document that can describes
the meaning of the document.
 It should be done
systematically without human
intervention. It should be
language independent model
Indian Language Document
Data Preprocessing
Candidate Keywords
TF IDF
Ranking
IDF
Selection of Keywords with
line/page no
26
Third DC Meeting
Algorithm
1. Dravidian language text document is tokenized
2. Stop words and frequent words elimination to get vocabulary words.
3. Vocabulary words are stored in the form of matrix called Vector space
model
4. Term frequency, Inverse document frequency and TF*IDF for each word is
calculated
5. Select the vocabulary words by fixing threshold value for TF*IDF.
6. Along with the keywords the corresponding line number or paragraph
number or page number is also extracted
27
Third DC Meeting
Recall =
No. of relevant documents
retrieved
Total no. of relevant
documents in the corpus
Recall = TP / (TP+FN) × 100
Precision = TP / (TP+FP) × 100
Recall Precision
Precision
=
No. of relevant documents
retrieved
Total no. of documents
retrieved from the corpus
28
Third DC Meeting
In the case of Tamil text, when
TFIDF is 2.3979 the recall is
100%. Therefore we are using
2.3979 as a TFIDF threshold
value to extract the keywords. So
we are considering those words
whose TFIDF value is grater then
2.3979 as keywords
29
Third DC Meeting
In the case of Kannada text, when
TF*IDF is 3.0910 the recall is
100%. Therefore we are using
3.0910 as a TFIDF threshold value
to extract the keywords. So we
are considering those words
whose TFIDF value is grater then
3.0910 as keywords
30
Third DC Meeting
In the case of Telugu text,(Fig
6) when TFIDF is 3.4095 the
recall is 100%. Therefore we
are using 3.4095 as a TFIDF
threshold value to extract the
keywords. So we are
considering those words
whose TFIDF value is grater
then 3.4095 as keywords
31
Third DC Meeting
Phase 3: Keyword extraction from Telugu Language
In the case of Telugu text,(Fig 6)
when TFIDF is 3.4095 the recall is
100%. Therefore we are using
3.4095 as a TFIDF threshold value
to extract the keywords. So we are
considering those words whose
TFIDF value is grater then 3.4095 as
keywords
32
Third DC Meeting
The third phase of work is keyword extraction. TF*IDF is a used to evaluate
how important is a word in a document. The TF*IDF is used as a threshold to
select the important keyword. In the case of Kannada text, when TFIDF is
3.0910 the recall is 100%. Therefore 3.0910 is used as a TFIDF threshold
value to extract the keywords. In the case of Tamil text, when TFIDF is 2.3979
the recall is 100%. Therefore 2.3979 is used as a TFIDF threshold value to
extract the keywords. In the case of Telugu text, when TFIDF is 3.4095 the
recall is 100%. Therefore 3.4095 is used as a TFIDF threshold value to extract
the keywords.
33
Third DC Meeting

Mais conteúdo relacionado

Semelhante a Third DC Meeting1.ppt

A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
kevig
 
Script identification from printed document images using statistical
Script identification from printed document images using statisticalScript identification from printed document images using statistical
Script identification from printed document images using statistical
IAEME Publication
 

Semelhante a Third DC Meeting1.ppt (20)

A017420108
A017420108A017420108
A017420108
 
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
 
Applsci 09-02758
Applsci 09-02758Applsci 09-02758
Applsci 09-02758
 
A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems
 
G1803013542
G1803013542G1803013542
G1803013542
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
electronics-11-01780-v2.pdf
electronics-11-01780-v2.pdfelectronics-11-01780-v2.pdf
electronics-11-01780-v2.pdf
 
A-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfA-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdf
 
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi Language
 
Creation of speech corpus for emotion analysis in Gujarati language and its e...
Creation of speech corpus for emotion analysis in Gujarati language and its e...Creation of speech corpus for emotion analysis in Gujarati language and its e...
Creation of speech corpus for emotion analysis in Gujarati language and its e...
 
Script identification from printed document images using statistical
Script identification from printed document images using statisticalScript identification from printed document images using statistical
Script identification from printed document images using statistical
 
IRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb PeopleIRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb People
 
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMA LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
 
A language independent approach to develop urduir system
A language independent approach to develop urduir systemA language independent approach to develop urduir system
A language independent approach to develop urduir system
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...
 
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
 
Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...
 
B tech project_report
B tech project_reportB tech project_report
B tech project_report
 
Summer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportSummer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) Report
 

Último

scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
HenryBriggs2
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
Health
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Último (20)

Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 

Third DC Meeting1.ppt

  • 1. India is multilingual nation. Text mining is a growing research area in data mining. So the aim is conduct a detailed study on text mining on Indian language. The objectives of the research work are as follows 1. To design a method for Indian language documents representation 2. To propose an algorithm to categorize documents based on language and domain. 3. To design a language independent algorithm to extract the keywords from all the Indian language documents. 1 Third DC Meeting
  • 4. Kannada ನಮಸ್ಕಾ ರ, ಶುಭ ಮುಂಜಾನೆ ಶುಭ ಮಧ್ಯಾ ಹ್ನ , ಶುಭ ರಾತ್ರ ಿ ಶುಭ ಹಾರೈಕೆ (ಗುಡ್ ಬೈ) ಧನಾ ವಾದಗಳು namaskAra ,shubha muMjAne shubha madhyAhna, shubha rAtri shubha hAraike (guD bai) dhanyavAdagaLu Tamil வணக்கம், (காலை) வணக்கம், (மதிய) வணக்கம் நை்லிரவாக அலமயட் டும் சென ் று வருகிறேன ் நன ் றி vaNakkam (kAlai) vaNakkam (matiya) vaNakkam ~nalliravAka amaiyaTTum cenRu varukiREn ~nanRi Telugu హలో, నమస్కా రం నమస్కా రం, నమస్తే నమస్తే, కృతజ్ఞతలు halO, namaskAraM namaskAraM , namastE namastE , kRutaj~jatalu 4 Third DC Meeting
  • 5. The research work is divided into 3 phases Documents Representation Vector Space Model Properties of Corpus Document categorization Based on the language Language Independent Classifier Keywords extraction Using TF, IDF, TFIDF 5 Third DC Meeting
  • 6. Data Preprocessing means converting unstructured data into structured data. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model After preprocessing data mining algorithms can be applied. In Text mining study, a document is used as a basic unit of analysis. To analyze the document, the first step is data preprocessing 6 Third DC Meeting
  • 7. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 7 Third DC Meeting
  • 8. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 8 Third DC Meeting
  • 9. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 9 Third DC Meeting
  • 10. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 10 Third DC Meeting
  • 11. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 11 Third DC Meeting
  • 12. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model Stemming is the term used to describe the process for reducing derived words to their root form Example: "cats“, "catlike", "catty“ -> "cat",  "stemmer", "stemming", "stemmed" -> "stem“ "fishing", "fished", and "fisher" -> "fish“ "argue", "argued", "argues“, "arguing", "argus" ->"argu” 12 Third DC Meeting
  • 13. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model W1 W2 W3 W4 W5 DOC1 1 0 2 1 0 DOC2 2 0 1 0 3 DOC3 0 1 2 2 0 dij represents number of times that term appears in the document 13 Third DC Meeting
  • 14. language kannada tamil telugu Docs 100 100 100 Tokens 26315 20360 18427 Vocabulary 20417 15941 14652 14 Third DC Meeting Zipf’s Law describes the word behavior in an entire corpus
  • 15. language kannada tamil telugu Docs 100 100 100 Tokens 26315 20360 18427 Vocabulary 20417 15941 14652 15 Third DC Meeting In natural language, there are a few frequent terms and many rare terms.
  • 16. language kannada tamil telugu Docs 100 100 100 Tokens 26315 20360 18427 Vocabulary 20417 15941 14652 16 Third DC Meeting Frequency * rank = constant. So frequency of a word is inversely proportional to its rank.
  • 17. Objective: To classify the documents based on language. Documents Classifier Tamil Language Documents Kannada Language Documents Telugu Language Documents 17 Third DC Meeting
  • 18. Algorithm 1. Identify specific language files. 2. Associate a Language label with each of the files. 3. Build a Corpus C 4. Preprocess the Corpus C. 5. Apply a Stemming algorithm to reduce all the words to their root form. 6. Generate VSM or a Term Document matrix using Binary Term Occurrence D( i, j where i is the document i and j is the jth term of document i.) 7. Train the Classifier (kNN,j48 and NB) using C as training examples. 18 Third DC Meeting
  • 19. Confusion Matrix kNN Classifier j48 Classifier NB Classifier Kannada Tamil Telugu Kannada Tamil Telugu Kannada Tamil Telugu 87 2 11 99 1 0 100 0 0 2 96 2 1 97 2 2 98 0 4 0 96 4 0 96 5 0 95 19 Third DC Meeting
  • 20. Data mining algorithms are used for English text categorization, similarly they can be applied for Indian language text categorization The effectiveness of classification algorithm kNN gives 93% accuracy Decision tree gives 97.33% accuracy Naïve Bayes gives97.66% accuracy. Naïve Bayes is efficient algorithm for Indian language text categorization. 20 Third DC Meeting
  • 21. The objective of this work is to design a language independent classifier to categories the documents based on domain. Documents Cinema Sports Politics Language Independent Classifier 21 Third DC Meeting
  • 22. S N Domain No of document s No of Tokens No of vocabulary No of vocabulary after removing Stop-words Case 1 Cinema 5 1378 978 943 Politics 5 831 560 537 Sports 5 712 426 398 Case 2 Cinema 48 13463 5302 5107 Sports 48 24096 8156 7934 22 Third DC Meeting
  • 23. The prediction of the classify models for case 1 are tabulated in the form of confusion matrix. kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66% J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00% 23 Third DC Meeting
  • 24. The prediction of the classify models for case 2 are tabulated in the form of confusion matrix. kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5% J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83% 24 Third DC Meeting
  • 25. In case 1, only five documents of three domains (Cinema, Sports, and Politics) For measuring the accuracy of classification algorithm, confusion matrix is used.  kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66% J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00% In case 2, 48 documents of two domains (Cinema, Sports) For measuring the accuracy of classification algorithm, confusion matrix is used. kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5% J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83% 25 Third DC Meeting
  • 26. Objective:  Keyword extraction is the task to identify a small set of words or keywords from a document that can describes the meaning of the document.  It should be done systematically without human intervention. It should be language independent model Indian Language Document Data Preprocessing Candidate Keywords TF IDF Ranking IDF Selection of Keywords with line/page no 26 Third DC Meeting
  • 27. Algorithm 1. Dravidian language text document is tokenized 2. Stop words and frequent words elimination to get vocabulary words. 3. Vocabulary words are stored in the form of matrix called Vector space model 4. Term frequency, Inverse document frequency and TF*IDF for each word is calculated 5. Select the vocabulary words by fixing threshold value for TF*IDF. 6. Along with the keywords the corresponding line number or paragraph number or page number is also extracted 27 Third DC Meeting
  • 28. Recall = No. of relevant documents retrieved Total no. of relevant documents in the corpus Recall = TP / (TP+FN) × 100 Precision = TP / (TP+FP) × 100 Recall Precision Precision = No. of relevant documents retrieved Total no. of documents retrieved from the corpus 28 Third DC Meeting
  • 29. In the case of Tamil text, when TFIDF is 2.3979 the recall is 100%. Therefore we are using 2.3979 as a TFIDF threshold value to extract the keywords. So we are considering those words whose TFIDF value is grater then 2.3979 as keywords 29 Third DC Meeting
  • 30. In the case of Kannada text, when TF*IDF is 3.0910 the recall is 100%. Therefore we are using 3.0910 as a TFIDF threshold value to extract the keywords. So we are considering those words whose TFIDF value is grater then 3.0910 as keywords 30 Third DC Meeting
  • 31. In the case of Telugu text,(Fig 6) when TFIDF is 3.4095 the recall is 100%. Therefore we are using 3.4095 as a TFIDF threshold value to extract the keywords. So we are considering those words whose TFIDF value is grater then 3.4095 as keywords 31 Third DC Meeting
  • 32. Phase 3: Keyword extraction from Telugu Language In the case of Telugu text,(Fig 6) when TFIDF is 3.4095 the recall is 100%. Therefore we are using 3.4095 as a TFIDF threshold value to extract the keywords. So we are considering those words whose TFIDF value is grater then 3.4095 as keywords 32 Third DC Meeting
  • 33. The third phase of work is keyword extraction. TF*IDF is a used to evaluate how important is a word in a document. The TF*IDF is used as a threshold to select the important keyword. In the case of Kannada text, when TFIDF is 3.0910 the recall is 100%. Therefore 3.0910 is used as a TFIDF threshold value to extract the keywords. In the case of Tamil text, when TFIDF is 2.3979 the recall is 100%. Therefore 2.3979 is used as a TFIDF threshold value to extract the keywords. In the case of Telugu text, when TFIDF is 3.4095 the recall is 100%. Therefore 3.4095 is used as a TFIDF threshold value to extract the keywords. 33 Third DC Meeting