SlideShare uma empresa Scribd logo
1 de 24
Submitted by,
Gokul K
LE48MCA15
No:28
FISAT
 Defining Text Mining
 Structured vs. Unstructured Data
 Why Text Mining
 Some Text Mining Ambiguities
 Text Mining Practice Areas
 Pre-processing Techniques
 Challenges in Text Mining
 Conclusion
• The use of computational methods and techniques to
extract high quality information from text
• The discovery by computer of new, previously unknown
information, by automatically extracting information from a
usually large amount of different unstructured textual
resources
 We have a collection of documents (mainly text or
html-based)
 We have a set of users
 A user wants to retrieve the documents related to
a given concept
 He consequently submits a query expressed
through words or terms
 An information retrieval system returns the
documents most related to this concept
 Unstructured text is present in various forms, and
in huge and ever increasing quantities:
1. books
2. financial and other business reports
3. various kinds of business and
administrative documents
4. news articles
 It is estimated that ~80% of all the available data are
unstructured data
 TM research and practice are focused on the
development, continual improvement and
application of such methods
 To enable effective and efficient use of such huge
quantities of textual content, we need
computational methods for
1. automated extraction of information from
unstructured text
2. analysis and summarization of extracted
information
 Language is ambiguous
 Context is needed to clarify
 The same words can have different meaning
 Bear (verb) – to support or carry
 Bear (noun) – a large animal
 Different words can mean the same (synonyms)
 Language is subtle(difficult to analyse
 Concept / word extraction usually results in huge number of
dimensions
 Thousands of new fields
 Each field typically has low information content (sparse)
 Misspellings, abbreviations, spelling variants
 Renders search engines, SQL queries.. ineffective.
 Homonomy: same word, different meaning
Mary walked along the bank of the river
HarborBank is the richest bank in the citys
 Synonymy: Synonyms, different words, similar or
same meaning, can substitute one word for other
without changing meaning.
Miss Nelson became a kind of big sister to Benjamin
Miss Nelson became a kind of large sister to Benjamin.
 Polysemy: same word or form, but different,
albeit related meaning
The bank raised its interest rates yesterday
The store is next to the newly constructed bank
The bank appeared first in Italy I the Renaissance
 Hyponymy: Concept hierarchy or subclass
Animal (noun) – cat, dog
Injury – broken leg, intusion
 Search and Information Retrieval – storage and
retrieval of text documents, including search
engines and keyword search
 Document Clustering – Grouping and categorizing
terms, snippets, paragraphs or documents using
clustering methods
 Document Classification – grouping and
categorizing snippets, paragraphs or document
using data mining classification methods, based on
methods trained on labelled examples
 Web Mining – Data and Text mining on the
internet with specific focus on scaled and
interconnectedness of the web
 Information Extraction – Identification and
extraction of relevant facts and relationships from
unstructured text
 Natural Language Processing – Low level language
processing and understanding of tasks (eg. Tagging
part of speech)
 Concept extraction – Grouping of words and
phrases into semantically similar groups
 Document – a sequence of words and punctuation,
following the grammatical rules of the language.
 Term – usually a word, but can be a word-pair or
phrase
 Corpus – a collection of documents
 Lexicon – set of all unique words in corpus
 Text Normalization
 Parts of Speech Tagging
 Removal of stop words
 Stop words – common words that don’t add
meaningful content to the document
 Stemming
 Removing suffices and prefixes leaving the root or stem of
the word.
 Tokenization
 Case
 Make all lower case (if you don’t care about proper
nouns, titles, etc)
 Clean up transcription and typing errors
 do n’t, movei
 Correct misspelled words
 Phonetically
 Use fuzzy matching algorithms such as Soundex,
Metaphone or string edit distance
 Dictionaries
 Use POS and context to make good guess
 POS tagging is a process of assigning a POS or
lexical class marker to each word in a sentence
(and all sentences in a corpus).
 Input: the lead paint is unsafe
 Output: the/Det lead/N paint/N is/V
unsafe/Adj
 Tokenization is the process of breaking a stream
of text up into words, phrases, symbols, or other
meaningful elements called tokens.
 Converts streams of characters into words
 Tokens or words are separated by whitespace,
punctuation marks or line breaks.
 Normalizes / unifies variations of the same data
 ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk
 Inflectional stemming
 Remove plurals
 Normalize verb tenses
 Remove other affixes
 Stemming to root
 Reduce word to most basic element
 More aggressive than inflectional
 ‘ ‘Apply’, ‘applications’, ‘reapplied’  apply
 The uppermost problem in text mining is the ambiguity
of the language i.e. the capability of being understood in
two or more possible sense. Because one word or phrase
may have multiple meanings those can lead to ambiguity
problem.
 In fields like Bioinformatics there are multiple names
for a single gene or protein that may also lead to
ambiguity problem.
  One more problem with test mining is when we
use the social media data i.e. status updates,
tweets, comments, reviews etc. most people use
slang words like- “btw” for by the way, “ppl” for
people etc. these words do not exist in the
dictionary that’s why they affects the mining
results.
 Another problem with text mining is cleaning the
data, if we extract online texts then we also get the
reference addresses of the images linked with the
text and those references are hard to remove.
Text analysis presently is really a fascinating technique
to determine the useful results from the textual data. By
using text mining techniques we can easily extract public
reviews, can classify the text into predefined classes, can
conclude the documents and also can make group or
cluster of multiple documents.
 https://en.wikipedia.org/wiki/Text_mining
 http://searchbusinessanalytics.techtarget.com/defi
nition/text-mining
 https://www.ijircce.com/upload/2016/april/40_Tex
t.pdf
Textmining

Mais conteúdo relacionado

Mais procurados

Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
sstose
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
Mariana Soffer
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
Lokesh Ramaswamy
 

Mais procurados (20)

11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)11 terms in corpus linguistics1 (1)
11 terms in corpus linguistics1 (1)
 
Natural Language Processing glossary for Coders
Natural Language Processing glossary for CodersNatural Language Processing glossary for Coders
Natural Language Processing glossary for Coders
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
Semantics and Computational Semantics
Semantics and Computational SemanticsSemantics and Computational Semantics
Semantics and Computational Semantics
 
Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  	Web classification of Digital Libraries using GATE Machine Learning  
Web classification of Digital Libraries using GATE Machine Learning  
 
Survey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse DictionarySurvey On Building A Database Driven Reverse Dictionary
Survey On Building A Database Driven Reverse Dictionary
 
Lecture 2: Computational Semantics
Lecture 2: Computational SemanticsLecture 2: Computational Semantics
Lecture 2: Computational Semantics
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Terminology work and term databases in Estonia
Terminology work and term databases in EstoniaTerminology work and term databases in Estonia
Terminology work and term databases in Estonia
 
Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...Taking into account communities of practice’s specific vocabularies in inform...
Taking into account communities of practice’s specific vocabularies in inform...
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Text mining introduction-1
Text mining   introduction-1Text mining   introduction-1
Text mining introduction-1
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation2010-04-29-swnj-pcls-presentation
2010-04-29-swnj-pcls-presentation
 
3. introduction to text mining
3. introduction to text mining3. introduction to text mining
3. introduction to text mining
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 Routledge
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Lecture: Semantic Word Clouds
Lecture: Semantic Word CloudsLecture: Semantic Word Clouds
Lecture: Semantic Word Clouds
 

Destaque

II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
Dr. Haxel Consult
 

Destaque (15)

Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
II-SDV 2012 Dealing with Large Data Volumes in Statistical Analysis and Text ...
 
Essae Technologys Private Limited, Bengaluru, Desktop Printers
Essae Technologys Private Limited, Bengaluru, Desktop PrintersEssae Technologys Private Limited, Bengaluru, Desktop Printers
Essae Technologys Private Limited, Bengaluru, Desktop Printers
 
Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...
Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...
Suspensa lei que obrigava concessionárias a plantar árvore por cada veículo v...
 
Outsourcing in Greece
Outsourcing in GreeceOutsourcing in Greece
Outsourcing in Greece
 
Premsons Plastics Private Limited, Mumbai, Plastic Water Bottles
Premsons Plastics Private Limited, Mumbai, Plastic Water BottlesPremsons Plastics Private Limited, Mumbai, Plastic Water Bottles
Premsons Plastics Private Limited, Mumbai, Plastic Water Bottles
 
Embargos infringentes
Embargos infringentesEmbargos infringentes
Embargos infringentes
 
Polje, Јovan Ducic
Polje, Јovan DucicPolje, Јovan Ducic
Polje, Јovan Ducic
 
Company 2 EBITDA and CROCI
Company 2 EBITDA and CROCICompany 2 EBITDA and CROCI
Company 2 EBITDA and CROCI
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
 
Text analytics opportunities in the Insurance domain
Text analytics opportunities in the Insurance domainText analytics opportunities in the Insurance domain
Text analytics opportunities in the Insurance domain
 
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentationAn Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
 
agenesia , aplasias y hipoplasia pulmonar
agenesia , aplasias y hipoplasia pulmonaragenesia , aplasias y hipoplasia pulmonar
agenesia , aplasias y hipoplasia pulmonar
 
Opening sequence conventions
Opening sequence conventionsOpening sequence conventions
Opening sequence conventions
 

Semelhante a Textmining

02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
beshahashenafe20
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovy
Sagar Dabhi
 

Semelhante a Textmining (20)

Information retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.pptInformation retrieval chapter 2-Text Operations.ppt
Information retrieval chapter 2-Text Operations.ppt
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic Computing
 
Chapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdfChapter 2 Text Operation.pdf
Chapter 2 Text Operation.pdf
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdf
 
Metaphic or the art of looking another way.
Metaphic or the art of looking another way.Metaphic or the art of looking another way.
Metaphic or the art of looking another way.
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
05 handbook summ-hovy
05 handbook summ-hovy05 handbook summ-hovy
05 handbook summ-hovy
 
Chapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrievalChapter 2: Text Operation in information stroage and retrieval
Chapter 2: Text Operation in information stroage and retrieval
 
Nlp
NlpNlp
Nlp
 
Ir 03
Ir   03Ir   03
Ir 03
 
A0210110
A0210110A0210110
A0210110
 
Nlp (1)
Nlp (1)Nlp (1)
Nlp (1)
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful in
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Natural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptxNatural Language Processing_in semantic web.pptx
Natural Language Processing_in semantic web.pptx
 
Aq35241246
Aq35241246Aq35241246
Aq35241246
 

Último

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Último (20)

Third Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptxThird Battle of Panipat detailed notes.pptx
Third Battle of Panipat detailed notes.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 

Textmining

  • 2.  Defining Text Mining  Structured vs. Unstructured Data  Why Text Mining  Some Text Mining Ambiguities  Text Mining Practice Areas  Pre-processing Techniques  Challenges in Text Mining  Conclusion
  • 3. • The use of computational methods and techniques to extract high quality information from text • The discovery by computer of new, previously unknown information, by automatically extracting information from a usually large amount of different unstructured textual resources
  • 4.  We have a collection of documents (mainly text or html-based)  We have a set of users  A user wants to retrieve the documents related to a given concept  He consequently submits a query expressed through words or terms  An information retrieval system returns the documents most related to this concept
  • 5.
  • 6.  Unstructured text is present in various forms, and in huge and ever increasing quantities: 1. books 2. financial and other business reports 3. various kinds of business and administrative documents 4. news articles  It is estimated that ~80% of all the available data are unstructured data
  • 7.  TM research and practice are focused on the development, continual improvement and application of such methods  To enable effective and efficient use of such huge quantities of textual content, we need computational methods for 1. automated extraction of information from unstructured text 2. analysis and summarization of extracted information
  • 8.  Language is ambiguous  Context is needed to clarify  The same words can have different meaning  Bear (verb) – to support or carry  Bear (noun) – a large animal  Different words can mean the same (synonyms)  Language is subtle(difficult to analyse  Concept / word extraction usually results in huge number of dimensions  Thousands of new fields  Each field typically has low information content (sparse)  Misspellings, abbreviations, spelling variants  Renders search engines, SQL queries.. ineffective.
  • 9.  Homonomy: same word, different meaning Mary walked along the bank of the river HarborBank is the richest bank in the citys  Synonymy: Synonyms, different words, similar or same meaning, can substitute one word for other without changing meaning. Miss Nelson became a kind of big sister to Benjamin Miss Nelson became a kind of large sister to Benjamin.
  • 10.  Polysemy: same word or form, but different, albeit related meaning The bank raised its interest rates yesterday The store is next to the newly constructed bank The bank appeared first in Italy I the Renaissance  Hyponymy: Concept hierarchy or subclass Animal (noun) – cat, dog Injury – broken leg, intusion
  • 11.  Search and Information Retrieval – storage and retrieval of text documents, including search engines and keyword search  Document Clustering – Grouping and categorizing terms, snippets, paragraphs or documents using clustering methods  Document Classification – grouping and categorizing snippets, paragraphs or document using data mining classification methods, based on methods trained on labelled examples  Web Mining – Data and Text mining on the internet with specific focus on scaled and interconnectedness of the web
  • 12.  Information Extraction – Identification and extraction of relevant facts and relationships from unstructured text  Natural Language Processing – Low level language processing and understanding of tasks (eg. Tagging part of speech)  Concept extraction – Grouping of words and phrases into semantically similar groups
  • 13.  Document – a sequence of words and punctuation, following the grammatical rules of the language.  Term – usually a word, but can be a word-pair or phrase  Corpus – a collection of documents  Lexicon – set of all unique words in corpus
  • 14.  Text Normalization  Parts of Speech Tagging  Removal of stop words  Stop words – common words that don’t add meaningful content to the document  Stemming  Removing suffices and prefixes leaving the root or stem of the word.  Tokenization
  • 15.
  • 16.  Case  Make all lower case (if you don’t care about proper nouns, titles, etc)  Clean up transcription and typing errors  do n’t, movei  Correct misspelled words  Phonetically  Use fuzzy matching algorithms such as Soundex, Metaphone or string edit distance  Dictionaries  Use POS and context to make good guess
  • 17.  POS tagging is a process of assigning a POS or lexical class marker to each word in a sentence (and all sentences in a corpus).  Input: the lead paint is unsafe  Output: the/Det lead/N paint/N is/V unsafe/Adj
  • 18.  Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.  Converts streams of characters into words  Tokens or words are separated by whitespace, punctuation marks or line breaks.
  • 19.  Normalizes / unifies variations of the same data  ‘walking’, ‘walks’, ‘walked’, ‘walked’  walk  Inflectional stemming  Remove plurals  Normalize verb tenses  Remove other affixes  Stemming to root  Reduce word to most basic element  More aggressive than inflectional  ‘ ‘Apply’, ‘applications’, ‘reapplied’  apply
  • 20.  The uppermost problem in text mining is the ambiguity of the language i.e. the capability of being understood in two or more possible sense. Because one word or phrase may have multiple meanings those can lead to ambiguity problem.  In fields like Bioinformatics there are multiple names for a single gene or protein that may also lead to ambiguity problem.
  • 21.   One more problem with test mining is when we use the social media data i.e. status updates, tweets, comments, reviews etc. most people use slang words like- “btw” for by the way, “ppl” for people etc. these words do not exist in the dictionary that’s why they affects the mining results.  Another problem with text mining is cleaning the data, if we extract online texts then we also get the reference addresses of the images linked with the text and those references are hard to remove.
  • 22. Text analysis presently is really a fascinating technique to determine the useful results from the textual data. By using text mining techniques we can easily extract public reviews, can classify the text into predefined classes, can conclude the documents and also can make group or cluster of multiple documents.