2. Defining Text Mining
Structured vs. Unstructured Data
Why Text Mining
Some Text Mining Ambiguities
Text Mining Practice Areas
Pre-processing Techniques
Challenges in Text Mining
Conclusion
3. • The use of computational methods and techniques to
extract high quality information from text
• The discovery by computer of new, previously unknown
information, by automatically extracting information from a
usually large amount of different unstructured textual
resources
4. We have a collection of documents (mainly text or
html-based)
We have a set of users
A user wants to retrieve the documents related to
a given concept
He consequently submits a query expressed
through words or terms
An information retrieval system returns the
documents most related to this concept
5.
6. Unstructured text is present in various forms, and
in huge and ever increasing quantities:
1. books
2. financial and other business reports
3. various kinds of business and
administrative documents
4. news articles
It is estimated that ~80% of all the available data are
unstructured data
7. TM research and practice are focused on the
development, continual improvement and
application of such methods
To enable effective and efficient use of such huge
quantities of textual content, we need
computational methods for
1. automated extraction of information from
unstructured text
2. analysis and summarization of extracted
information
8. Language is ambiguous
Context is needed to clarify
The same words can have different meaning
Bear (verb) – to support or carry
Bear (noun) – a large animal
Different words can mean the same (synonyms)
Language is subtle(difficult to analyse
Concept / word extraction usually results in huge number of
dimensions
Thousands of new fields
Each field typically has low information content (sparse)
Misspellings, abbreviations, spelling variants
Renders search engines, SQL queries.. ineffective.
9. Homonomy: same word, different meaning
Mary walked along the bank of the river
HarborBank is the richest bank in the citys
Synonymy: Synonyms, different words, similar or
same meaning, can substitute one word for other
without changing meaning.
Miss Nelson became a kind of big sister to Benjamin
Miss Nelson became a kind of large sister to Benjamin.
10. Polysemy: same word or form, but different,
albeit related meaning
The bank raised its interest rates yesterday
The store is next to the newly constructed bank
The bank appeared first in Italy I the Renaissance
Hyponymy: Concept hierarchy or subclass
Animal (noun) – cat, dog
Injury – broken leg, intusion
11. Search and Information Retrieval – storage and
retrieval of text documents, including search
engines and keyword search
Document Clustering – Grouping and categorizing
terms, snippets, paragraphs or documents using
clustering methods
Document Classification – grouping and
categorizing snippets, paragraphs or document
using data mining classification methods, based on
methods trained on labelled examples
Web Mining – Data and Text mining on the
internet with specific focus on scaled and
interconnectedness of the web
12. Information Extraction – Identification and
extraction of relevant facts and relationships from
unstructured text
Natural Language Processing – Low level language
processing and understanding of tasks (eg. Tagging
part of speech)
Concept extraction – Grouping of words and
phrases into semantically similar groups
13. Document – a sequence of words and punctuation,
following the grammatical rules of the language.
Term – usually a word, but can be a word-pair or
phrase
Corpus – a collection of documents
Lexicon – set of all unique words in corpus
14. Text Normalization
Parts of Speech Tagging
Removal of stop words
Stop words – common words that don’t add
meaningful content to the document
Stemming
Removing suffices and prefixes leaving the root or stem of
the word.
Tokenization
15.
16. Case
Make all lower case (if you don’t care about proper
nouns, titles, etc)
Clean up transcription and typing errors
do n’t, movei
Correct misspelled words
Phonetically
Use fuzzy matching algorithms such as Soundex,
Metaphone or string edit distance
Dictionaries
Use POS and context to make good guess
17. POS tagging is a process of assigning a POS or
lexical class marker to each word in a sentence
(and all sentences in a corpus).
Input: the lead paint is unsafe
Output: the/Det lead/N paint/N is/V
unsafe/Adj
18. Tokenization is the process of breaking a stream
of text up into words, phrases, symbols, or other
meaningful elements called tokens.
Converts streams of characters into words
Tokens or words are separated by whitespace,
punctuation marks or line breaks.
19. Normalizes / unifies variations of the same data
‘walking’, ‘walks’, ‘walked’, ‘walked’ walk
Inflectional stemming
Remove plurals
Normalize verb tenses
Remove other affixes
Stemming to root
Reduce word to most basic element
More aggressive than inflectional
‘ ‘Apply’, ‘applications’, ‘reapplied’ apply
20. The uppermost problem in text mining is the ambiguity
of the language i.e. the capability of being understood in
two or more possible sense. Because one word or phrase
may have multiple meanings those can lead to ambiguity
problem.
In fields like Bioinformatics there are multiple names
for a single gene or protein that may also lead to
ambiguity problem.
21. One more problem with test mining is when we
use the social media data i.e. status updates,
tweets, comments, reviews etc. most people use
slang words like- “btw” for by the way, “ppl” for
people etc. these words do not exist in the
dictionary that’s why they affects the mining
results.
Another problem with text mining is cleaning the
data, if we extract online texts then we also get the
reference addresses of the images linked with the
text and those references are hard to remove.
22. Text analysis presently is really a fascinating technique
to determine the useful results from the textual data. By
using text mining techniques we can easily extract public
reviews, can classify the text into predefined classes, can
conclude the documents and also can make group or
cluster of multiple documents.