This document discusses the benefits of text mining for organizations. It describes how text mining can analyze large amounts of text data through techniques like document classification, information retrieval, word frequency analysis, sentiment analysis, and topic modeling to provide meaningful insights. These insights can help with tasks like root cause analysis, competitive strategy development, and enhancing customer experience. The document provides an overview of the text mining process and examples of how organizations in different industries can utilize text mining.
2. Executive Summary
Text Mining has evolved and garnered
a lot of interest in recent times, not
without a reason though. Organizations
collect huge amounts of text data.
They can use various text mining tools
and techniques to analyse that data
for meaningful, actionable insights.
This paper discusses how automatic
document classification, information
retrieval, word frequency calculation,
sentiment analysis, topic modelling
and trend analysis can be utilized for
root cause analysis, devising
competitive strategies, enhancing
customer experienceand so on.
The Approach
Text data can be in different forms - Facebook scraping tools. Known attributes of the text, like
posts/comments, tweets, customer feedback, date, location, customer ID, business area, should
blog posts, sales rep. notes, patient health records, also be captured and recorded, as that would help
complaint logs, third-party surveys (free text in filtering and selective analysis. Incoming text
formats), newswires, newsfeeds, documents, etc. can be cleaned up by removing irrelevant data, if
Conceptually, text mining is a three-step required. This text data can then be mined with the
process (Figure 1). help of commercial programs (SAS, SPSS, SAP
Comments/feedback can be stored in simple Infinite Insight), niche commercial or open source tables in a spreadsheet/database and documents programs (Attensity, OpenText, SmartLogic, openNLP, can be stored in folders. Text available on social KNIME) or analytics tools (R, Python) (Figure 2).
media and other websites can be collectedusing
Figure 1
! Web Sensors
! Open and Paid API
! Web Scraping
! Internal Data
! Customer Feedbacks
! Value Chain Partner
Interaction
LOADING
! Data into flat files or
spreadsheets
! Data load in
Databases
! Advanced Text
Analytics
! Word Frequency
! Word Associations
! Sentiment Analysis
! Trend Analysis
EXTRACTION ANALYTICS
Figure 2
Survey
Emails
Logs
Web
Scaping
tools
! SAS Text Analytics
! SPASS Text Analytics
! SAP Infinitelnsight
! Smart Logic
! Rapid Miner
! Open Text
! R Test Mining
World
Cloud
Setiment
Analysis &
Trends
Topic
Analysis
BI Strategy & Framework BI Strategy & Framework
3. ! Machine Learning – Here, a training data set is Topic & Content Analysis
made available to the program with predefined
Advanced text mining can be used sentiments. The algorithm trains based on to analyze topics
occurrence of words or patterns, and new and context. Just likesentiment analysis, topic
incoming texts are classified based on the analysis can be done either through machine
occurrence of such patterns or words. learning or based on predefined dictionaries.
Advanced techniques, like Support Vector Machines,
! Sentiment Lexicons – Here, matches of positive Latent Dirichlet Allocation, etc., can also be used.
or negative words are found and a sentiment In addition, word clustering (Hierarchical or K Means
measure is calculated. Clustering techniques), network diagrams and word
However, it is easier said than done. The complexity associations can be used to look for topics or
of languages, sarcasm, colloquialism, poor spelling context based on natural word groups. Co-and
grammar can marthe accuracy of results in occurrence and proximity are two of the most
this area. In addition, sentiments need to be useful ways to group words in similar topics.
analyzed holistically and in a context. For e.g.,
finding two products or services similar can only
be construed positive or negative by knowing
about the two products or services being discussed.
The comment could be a compliment or criticism,
depending upon what it is being compared to.
Sentiment analysis engines are generally 60-70%
accurate (humans are 80% accurate), but results
of sentiment analysis are improving with
continuous research.
Techniques
Document Classification and Grouping returned. The simplest and most obvious similarity
measure is the count of common words. More
The immediate benefit organisations get form sophisticated measures include extending the idea
text mining is better classification of documents of word count, like count with weightage, cosine
(mails, feedback, comments). It helps them similarity, etc.
classify documents based on functions,
departments, business areas, etc., improving Word Frequency
efficiency. Queries or news wires can be classified
and routed to the concerned people, so they can One of the most widely used techniques for text
take appropriate action.Techniques like K Nearest mining based on word frequency is wordcloud – a
Neighbor (KNN), Decision Trees, Bayesian visually appealing representation of the most frequent
Classifier, etc. can all be used to classify words in the text data, which makes the font size
documents. proportional to the frequency of the word. Most
text mining tools provide multiple options to control
Usually, documents are matched against already the number of words, remove redundant words,
known categories, but there could be cases reduce the frequency of similar words,etc. They
where several incoming documents need to be offer a simple albeit insightful way to highlight
grouped, without classifying them based on hitherto unknown aspects of the business. Wordclouds
their structures. This is where the concept of can be made more insightful by building drill-high
homogeneity and high heterogeneity comes down capabilities that can search for comments
into picture. In this regard, document classification with a specified ‘keyword’. Organisations can
and grouping is very similar to supervised and either opt for specialized software to generate
unsupervised learning respectively. wordclouds, or choose general purpose text mining
programs that have such functionalities in-built in Information Retrieval them.
Businesses need information to base their Sentiment Analysis decisions on, and hence must have such insightful
information at their disposal. Information retrieval The most exciting and challenging aspect oftext
refers to the technique of fetching a set of documents mining is analyzing the sentiments in the text data.
containing desired information. This can be Imagine a scenario where businesses can understand
achieved by passing a ‘clue’ to the program, the sentiments (positive, negative or neutral)
which can be in form of a keyword or combination latent in comments/posts/feedback without going
of words. The program returns documents through copious amounts of text data. Such
containing text with the closest match to the ‘clue’. information would be extremely valuable to
From text mining perspective, single or multiple organisations operating in customer centric B2C
words passed as ‘clue’ can be considered as a markets, like hospitality, travel, banking, retail, etc.
document. Based on ‘similarity’ measures, a Sentiments can be analyzedin two ways:
set of documents with the best match is
BI Strategy & Framework BI Strategy & Framework
4. Business Use Cases
Text mining finds application across industries Text mining can also be used in tandem with
and functions – travel,retail, banking, hospitality, voice-to-text technology for analyzing transcripts
and healthcare, etc. Market intelligence teams of the cockpit voice recorders of airlinestogain
can segregate news feeds and web articles insights. This can help understand the reasons
based on document classification or grouping. behind anomalous and risky flight manoeuvres or
Similarly, incoming emails can be separated flight incidences. Voice-based feedback in hotel or
and auto-forwarded to respective teams. An banks can also serve as important inputs for text
airline company may find that its in-flight mining.
entertainment system or baggage handling
process is a sore point for its customers, or a
hotel may find that a seemingly innocuous
construction near its premises could irritate its
otherwise satisfied customers. HR teams can
analyze employee feedback and find potential
areas of improvement for organizational
development. Reviews can be made even more
valuable by crawling the top web searches for
user queries and providing text mining results
(frequent words, sentiments, broad areas of
discussions, trends etc.) in real time.
Text mining can be very valuable for both intra
as well as inter-organizational benchmarking.
One can view word clouds, sentiments, etc. in
two different time frames, for two different
geographies, departments, functions, etc. Text
mining can be used for comparison against
industry rivals too. If time stamped data is
available, it can be used for a trend analysis of
sentiments or social outreach (no. of
comments, posts, likes etc.). For e.g., an F&B
organization can see how its flagship product
pitches against its rival’s product. A simple
wordcloud and drilldown can reveal that it is
not considered as healthya breakfast
companion as the other beverage sold by its
competitor. It can be also be used to find
associations between diseases and interaction
between and adverse effects of drugs.
BI Strategy & Framework BI Strategy & Framework
5. Author
Anand Nath Jha,
Analytics Architect – DWBI & Analytics,
ITC Infotech
Mr. Anand has 17 years of diverse
experience in strategic Marketing,
Analytics, Project Management and
Aerospace Engineering. He has worked
for Honeywell, General Electric,
Hindustan Aeronautics Ltd. and LM
Windpower. He graduated in Aerospace
Engineering from IIT Kanpur, and pursued
MBA and Advanced Certification in
Analytics from University of Phoenix and
IIM Lucnkow respectively.
Co-Author
Viros Sharma,
Vice President & Global Practice Head-
DWBI & Analytics, ITC Infotech
Mr. Viros has more than 20 years of
experience in DW/BI Consulting and Practice-
Building space. He has worked for
multinational IT companies like BearingPoint
and iGATE in India and USA. He did his AMP
from IIMB and holds double masters in
Mathematics and Computer Applications.
About ITC Infotech
ITC Infotech, a fully owned subsidiary of USD 7 billion ITC Ltd, provides IT services and solutions to leading global
customers. The company has carved a niche for itself by addressing customer challenges through innovative IT
solutions.
ITC Infotech is focused on servicing the BFSI (Banking, Financial Services & Insurance), CPG&R (Consumer Packaged
Goods & Retail), Life Sciences, Manufacturing & Engineering Services, THT (Travel, Hospitality and Transportation) and
Media & Entertainment industries.
For more information, please visit http://www.itcinfotech.com | Or write to: contact.us@itcinfotech.com