2. Text Mining is about…
“Sifting through vast collections of unstructured or
semistructured data beyond the reach of data mining
tools, text mining tracks information sources, links isolated
concepts in distant documents, maps relationships
between activities, and helps answer questions.”
Tapping the Power of Text Mining
Communications of the ACM, Sept. 2006
2
3. Humans VS. Computers
• Humans: Ability to distinguish and apply linguistic patterns to text
– Could overcome language difficulties such as slangs, spelling
variations, contextual meaning
• Computers: Ability to process text in large volumes at high speed
– Could sift through a large collection of texts to find simple statistics
and relationship among terms in an instant of time
• Text mining requires a combination of both
Human's linguistic capability + computer's speed and accuracy
NLP Data Mining
4. Text Mining Tasks
• Information extraction:
– Analyze unstructured text and identify key words or
phrases and relationships within text
• Topic detection and tracking:
– Filter and present only documents relevant to the user
profile
• Summarization:
– Text summarization reduces the content by retaining
only its main points and overall meaning
4
5. Text Mining Tasks
• Categorization:
– Automatic classify documents into predefined
categories
• Clustering:
– Group similar documents based on their similarity
• Concept Linkage
– Connect related documents by identifying their shared
concepts, helping users find information they perhaps
wouldn't have found through traditional search methods
5
6. Text Mining Tasks
• Information Visualization
– Represent documents or information in graphical
formats for easily browsing, viewing, or searching
• Question and answering (Q&A)
– Search and extract the best answer to a given question
6
7. Applications: Tech Mining
• Tech Mining is the application of text mining
tools to science and technology (S&T)
information particularly bibliographic abstracts
• It exploits the S&T databases to see patterns,
detect associations, and foresee opportunities
7
16. Applications: ABDUL
(Artificial BudDy U Love)
• An online information service which currently provides
access to Thai linguistic (e.g., dictionary and sentence
translation) and information resources (e.g., weather
condition, stock price, gas price, traffic condition, etc.)
• Users are able to use natural language to interact with
ABDUL via Instant Messaging (IM) based protocol, Web
browser, and Mobile devices
16
20. User-Generated Contents
• With the Web 2.0 or social networking websites, the
amount of user-generated contents has increased
exponentially
• User-generated contents often contain opinions and/or
sentiments
• An in-depth analysis of these opinionated texts could
reveal potentially useful information, e.g.,
– Preferences of people towards many different topics including news
events, social issues and commercial products
20
22. Characteristics of Online
Reviews
• Natural language and unstructured text format
• Some reviews are long and contain only a few
sentences expressing opinions on the product
• Could be difficult for a potential reader to
understand and analyze each review that
maybe relevant to his or her decision making
22
23. Opinion Mining
• Opinion mining and sentiment analysis is a task for
analyzing and summarizing what people think about a
certain topic
• Opinion mining has gained a lot of interest in text mining
and NLP communities
• Three granularities of opinion mining:
– Document level
– Sentence level
– Feature level
23
24. Feature-Based Opinion Mining
• This approach typically consists of two following
steps:
1. Identifying and extracting features of an object,
topic or event from each sentence
2. Determining whether the opinions regarding the
features are positive or negative
24
30. Challenges in Text Mining
• Text Mining = NLP + Data Mining
• Statistical NLP
– Ambiguity
– Context
– Tokenization Sentence Detection
– POS tagging
• Data Mining
– Ability to process the data
– Massive amounts of data
– Determining and extracting information of interest
30
31. Conclusions
• As the amount of data increases, text-mining
tools that sift through it will be increasingly
valuable
• Various applications for academic and industry
uses
31