SlideShare a Scribd company logo
1 of 16
TEXT MINING
BY
THEJESWINI
B.Tech CSE 3 Year
SUBCODE:XCSE65
SUBNAME:DATA MINING
CONTENTS
INTRODUCTION
DATA MINING vs TEXT MINING
AREAS OF TEXT MINING
INFORMATION RETRIEVAL
TEXT MINING PROCESS
TEXT MINING APPROACHES
CHALLENGES OF TEXT MINING
REFERNECES
INTRODUCTION
 Nowadays, there is a rapid growth in text databases due to many sources
generating data in text.
 Sources that generate text databases are : collections of documents from
various sources - such as news articles, research papers, books, digital
libraries, e-mail messages, and World Wide web(which can also be viewed
as a huge, interconnected, dynamic text database) and also many
government and business institutions also store their data in form of text.
 Understanding that generated text patterns and obtaining useful and
reliable information has become the main reason for text mining.
INTRODUCTION...(CONTD)
 Text mining is formally defined as process of extracting relevant
information or pattern from different sources that are in unstructured or
semi-structured format
 Data stored in most text databases are semi structured data ,i.e. they are
neither completely unstructured nor completely structured.
 For example, a document may contain a few structured fields, such as title,
authors, publication date, category, and so on, but also contain some
largely unstructured text components, such as abstract and contents.
DATA MINING vs TEXT MINING
DATA MINING TEXT MINING
It is the process of finding patterns and
extracting useful data from large data sets.
Is applied on data from from various text
documents
Applied on all types of data Applied on text data, which is mostly semi
structured or unstructured
Processing of data is done directly. Processing of data is done linguistically.
Statistical techniques are used to evaluate
data.
Computational linguistic principles are used
to evaluate text.
AREAS OF TEXT MINING
IR(Information
Retrieval)
NLP(Natural
Language
Processing)
IE(Information
Extraction)
Data Mining
Query based search on large text documents
The development of the NLP application generally expect
humans to "Speak" to them in a programming language that
is accurate, clear, and exceptionally structured. Human
speech is usually not authentic so that it can depend on
many complex variables, including slang, social context, and
regional dialects.
The automatic extraction of structured data such as entities,
entities relationships, and attributes describing entities from
an unstructured source is called information extraction.
Data mining refers to the extraction of useful data, hidden
patterns from large data sets. Data mining tools can predict
behaviors and future trends that allow businesses to make a
better data-driven decision..
INFORMATION RETRIEVAL
Information retrieval is a method to retrieve information from a large number
of text-based documents.
Due to the abundance of text information, information retrieval has found
many applications. There exist many information retrieval systems, such as :
-on-line library catalog systems,
-on-line document management systems, and
-the more recently developed Web search engine
 A typical information retrieval problem is to locate relevant documents in a
document collection based on a user’s query, which is often some keywords
describing an information need.
INFORMATION RETRIEVAL…(CONTD)
1. BASIC MEASURES OF INFORMATION RETRIEVAL
There are two basic measures for assessing the quality of text retrieval:
Precision: This is the percentage of retrieved documents that are in fact
relevant to the query (i.e., “correct” responses). It is formally defined as
Recall: This is the percentage of documents that are relevant to the query and
were, in fact, retrieved. It is formally defined as
One commonly used trade-off is the F-score, which is defined as the harmonic
mean of recall and precision:
precision = |{Relevant} ∩ {Retrieved}|/ |{Retrieved}|
recall = |{Relevant} ∩ {Retrieved}| /|{Relevant}|
F score = recall × precision (recall + precision)/2
INFORMATION RETRIEVAL…(CONTD)
2. TEXT RETRIEVAL METHODS
Information retrieval of text documents can be done by the following methods:
-Document selection method: In this method , the query is given by
specifying constraints for selecting relevant documents. A typical method of this
category is the “Boolean retrieval model”, in which a document is represented by
a set of keywords and a user provides a Boolean expression of keywords, such as
e.g: “car and repair shops” , “tea or coffee”
-Document ranking method: In this method, the query is used to rank all
documents in the order of relevance. The goal is to approximate the degree of
relevance of a document with a score computed based on information such as the
frequency of words in the document and the whole collection.
INFORMATION RETRIEVAL…(CONTD)
 The first step in most retrieval
systems is to identify keywords for
representing documents, a
preprocessing step often called
tokenization. To avoid indexing
useless words, a text retrieval system
often associates a “stop list” with a
set of documents.
Text Mining is a part of Data Mining
text mining part data
mining
TEXT MINING PROCESS
• Text preprocessing
-Syntactic/Semantic
-text analysis (Text cleanup, Tokenization)
• Features Generation
-Bag of words (words it contains and occurences)
-Vector space
• Features Selection
-Simple counting
-Statistics
• Text/Data Mining
-Classification(supervised)
-Clustering(unsupervised)
-Associations(relationships)
• Analyzing results
TEXT MINING APPROACHES
 The text mining approaches are based on the inputs taken in the text mining
system and the data mining tasks to be performed. In general, the major
approaches, based on the kinds of data they take as input, are:
(1) the keyword-based approach, where the input is a set of keywords or
terms in the documents,
(2) the tagging approach, where the input is a set of tags, and
(3)the information-extraction approach, which inputs semantic
information, such as events, facts, or entities uncovered by information
extraction.
1) KEY WORD ASSOCIATION BASED ANALYSIS:
It is an analysis which collects sets of keywords or terms that occur frequently
together and then finds the association or correlation relationships among them.
E.g. [Stanford, University]
2) DOCUMENT CLASSIFICATION ANALYSIS:
Automated document classification is an important text mining task because,
with the existence of a tremendous number of on-line documents, it is tedious yet
essential to be able to automatically organize such documents into classes to
facilitate document retrieval and subsequent analysis. E.g. Tagging
3) DOCUMENT CLUSTERING ANALYSIS:
Document clustering is one of the most crucial techniques for organizing
documents in an unsupervised manner.
TEXT MINING APPROACHES…(CONTD)
CHALLENGES OF TEXT MINING
 Information is in unstructured textual form
 Large textual database – Difficult to apply text mining
 Complex and subtle relationships between concepts in text
 Word ambiguity and context sensitivity
e.g windows can be either operating system or opening in the wall to
allow air flow in the house.
 Noisy data
Spelling mistakes and irrelevant data(outliers)
REFERENCES
[1]Jiawei Han University of Illinois at Urbana-Champaign Micheline Kamber
“Data Mining: Concepts and Techniques Second Edition”
[2] https://www.javatpoint.com/text-data-mining
[3] https://paginas.fe.up.pt/~ec/files_0405/slides/07%20TextMining.pdf
Text mining

More Related Content

What's hot

Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
Nanthini Dominique
 

What's hot (20)

Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Text mining
Text miningText mining
Text mining
 
Information retrieval 7 boolean model
Information retrieval 7 boolean modelInformation retrieval 7 boolean model
Information retrieval 7 boolean model
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
 
Text Mining
Text MiningText Mining
Text Mining
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Application of data mining
Application of data miningApplication of data mining
Application of data mining
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Data mining presentation.ppt
Data mining presentation.pptData mining presentation.ppt
Data mining presentation.ppt
 
Vector space model of information retrieval
Vector space model of information retrievalVector space model of information retrieval
Vector space model of information retrieval
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Data mining
Data miningData mining
Data mining
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
Information Retrieval Models
Information Retrieval ModelsInformation Retrieval Models
Information Retrieval Models
 
FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT FAKE NEWS DETECTION PPT
FAKE NEWS DETECTION PPT
 
4.4 text mining
4.4 text mining4.4 text mining
4.4 text mining
 
Data mining
Data mining Data mining
Data mining
 
Web search vs ir
Web search vs irWeb search vs ir
Web search vs ir
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
Data mining
Data mining Data mining
Data mining
 

Similar to Text mining

IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
rahulmonikasharma
 

Similar to Text mining (20)

Text Mining.pptx
Text Mining.pptxText Mining.pptx
Text Mining.pptx
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IR introduction
IR introductionIR introduction
IR introduction
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining Approach
 
Mam assign
Mam assignMam assign
Mam assign
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Text mining
Text miningText mining
Text mining
 
IRintroduction.ppt
IRintroduction.pptIRintroduction.ppt
IRintroduction.ppt
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
 
Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...Structured and Unstructured Information Extraction Using Text Mining and Natu...
Structured and Unstructured Information Extraction Using Text Mining and Natu...
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
 
A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
 
Week14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptxWeek14-Multimedia Information Retrieval.pptx
Week14-Multimedia Information Retrieval.pptx
 

Recently uploaded

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 

Recently uploaded (20)

Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 

Text mining

  • 1. TEXT MINING BY THEJESWINI B.Tech CSE 3 Year SUBCODE:XCSE65 SUBNAME:DATA MINING
  • 2. CONTENTS INTRODUCTION DATA MINING vs TEXT MINING AREAS OF TEXT MINING INFORMATION RETRIEVAL TEXT MINING PROCESS TEXT MINING APPROACHES CHALLENGES OF TEXT MINING REFERNECES
  • 3. INTRODUCTION  Nowadays, there is a rapid growth in text databases due to many sources generating data in text.  Sources that generate text databases are : collections of documents from various sources - such as news articles, research papers, books, digital libraries, e-mail messages, and World Wide web(which can also be viewed as a huge, interconnected, dynamic text database) and also many government and business institutions also store their data in form of text.  Understanding that generated text patterns and obtaining useful and reliable information has become the main reason for text mining.
  • 4. INTRODUCTION...(CONTD)  Text mining is formally defined as process of extracting relevant information or pattern from different sources that are in unstructured or semi-structured format  Data stored in most text databases are semi structured data ,i.e. they are neither completely unstructured nor completely structured.  For example, a document may contain a few structured fields, such as title, authors, publication date, category, and so on, but also contain some largely unstructured text components, such as abstract and contents.
  • 5. DATA MINING vs TEXT MINING DATA MINING TEXT MINING It is the process of finding patterns and extracting useful data from large data sets. Is applied on data from from various text documents Applied on all types of data Applied on text data, which is mostly semi structured or unstructured Processing of data is done directly. Processing of data is done linguistically. Statistical techniques are used to evaluate data. Computational linguistic principles are used to evaluate text.
  • 6. AREAS OF TEXT MINING IR(Information Retrieval) NLP(Natural Language Processing) IE(Information Extraction) Data Mining Query based search on large text documents The development of the NLP application generally expect humans to "Speak" to them in a programming language that is accurate, clear, and exceptionally structured. Human speech is usually not authentic so that it can depend on many complex variables, including slang, social context, and regional dialects. The automatic extraction of structured data such as entities, entities relationships, and attributes describing entities from an unstructured source is called information extraction. Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data mining tools can predict behaviors and future trends that allow businesses to make a better data-driven decision..
  • 7. INFORMATION RETRIEVAL Information retrieval is a method to retrieve information from a large number of text-based documents. Due to the abundance of text information, information retrieval has found many applications. There exist many information retrieval systems, such as : -on-line library catalog systems, -on-line document management systems, and -the more recently developed Web search engine  A typical information retrieval problem is to locate relevant documents in a document collection based on a user’s query, which is often some keywords describing an information need.
  • 8. INFORMATION RETRIEVAL…(CONTD) 1. BASIC MEASURES OF INFORMATION RETRIEVAL There are two basic measures for assessing the quality of text retrieval: Precision: This is the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses). It is formally defined as Recall: This is the percentage of documents that are relevant to the query and were, in fact, retrieved. It is formally defined as One commonly used trade-off is the F-score, which is defined as the harmonic mean of recall and precision: precision = |{Relevant} ∩ {Retrieved}|/ |{Retrieved}| recall = |{Relevant} ∩ {Retrieved}| /|{Relevant}| F score = recall × precision (recall + precision)/2
  • 9. INFORMATION RETRIEVAL…(CONTD) 2. TEXT RETRIEVAL METHODS Information retrieval of text documents can be done by the following methods: -Document selection method: In this method , the query is given by specifying constraints for selecting relevant documents. A typical method of this category is the “Boolean retrieval model”, in which a document is represented by a set of keywords and a user provides a Boolean expression of keywords, such as e.g: “car and repair shops” , “tea or coffee” -Document ranking method: In this method, the query is used to rank all documents in the order of relevance. The goal is to approximate the degree of relevance of a document with a score computed based on information such as the frequency of words in the document and the whole collection.
  • 10. INFORMATION RETRIEVAL…(CONTD)  The first step in most retrieval systems is to identify keywords for representing documents, a preprocessing step often called tokenization. To avoid indexing useless words, a text retrieval system often associates a “stop list” with a set of documents. Text Mining is a part of Data Mining text mining part data mining
  • 11. TEXT MINING PROCESS • Text preprocessing -Syntactic/Semantic -text analysis (Text cleanup, Tokenization) • Features Generation -Bag of words (words it contains and occurences) -Vector space • Features Selection -Simple counting -Statistics • Text/Data Mining -Classification(supervised) -Clustering(unsupervised) -Associations(relationships) • Analyzing results
  • 12. TEXT MINING APPROACHES  The text mining approaches are based on the inputs taken in the text mining system and the data mining tasks to be performed. In general, the major approaches, based on the kinds of data they take as input, are: (1) the keyword-based approach, where the input is a set of keywords or terms in the documents, (2) the tagging approach, where the input is a set of tags, and (3)the information-extraction approach, which inputs semantic information, such as events, facts, or entities uncovered by information extraction.
  • 13. 1) KEY WORD ASSOCIATION BASED ANALYSIS: It is an analysis which collects sets of keywords or terms that occur frequently together and then finds the association or correlation relationships among them. E.g. [Stanford, University] 2) DOCUMENT CLASSIFICATION ANALYSIS: Automated document classification is an important text mining task because, with the existence of a tremendous number of on-line documents, it is tedious yet essential to be able to automatically organize such documents into classes to facilitate document retrieval and subsequent analysis. E.g. Tagging 3) DOCUMENT CLUSTERING ANALYSIS: Document clustering is one of the most crucial techniques for organizing documents in an unsupervised manner. TEXT MINING APPROACHES…(CONTD)
  • 14. CHALLENGES OF TEXT MINING  Information is in unstructured textual form  Large textual database – Difficult to apply text mining  Complex and subtle relationships between concepts in text  Word ambiguity and context sensitivity e.g windows can be either operating system or opening in the wall to allow air flow in the house.  Noisy data Spelling mistakes and irrelevant data(outliers)
  • 15. REFERENCES [1]Jiawei Han University of Illinois at Urbana-Champaign Micheline Kamber “Data Mining: Concepts and Techniques Second Edition” [2] https://www.javatpoint.com/text-data-mining [3] https://paginas.fe.up.pt/~ec/files_0405/slides/07%20TextMining.pdf