Big Data Text Analytics Techniques for Information Extraction
1. International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
E-ISSN: 2321-9637
166
Big Data: Text Analytics
Mrs.Balshetwar S.V.1, Prof (Dr.)Tugnayat R.M.2
1HOD, Department of Information Technology, Satara College of Engineering and Management ,limb,satara
Shivaji University, Kolhapur (Maharashtra) Email: balshetwar.satara@gmail.com
2Principal, Shri Shankarprasad Agnihotri College of Engineering,,Wardha
Sant Gadge Baba Amravati university, Amravati (Maharashtra)
Abstract-Every part of this technological world is flooded of big data today. Almost 80% of this big data is
unstructured, because the data comes from various new sources like device logs, server logs, twitter feeds, chat
data, blogs, web pages, emails, social media content. This makes a huge collection of text data which is created
by humans to express themselves to others, so it has become an important source of data that may contain
valuable information. Through text analytics techniques we can extract information from these collected sources
of data and utilize them in customer management, sentiment analysis, and collaborative analysis. This paper
discusses some basic techniques that identify useful patterns from the text in big data.
Index Terms - Big data, Text analytics, LDA.
1. INTRODUCTION
A text is piece of information through which
human communicate with each other. Broad range of
application and devices are available for text
communication and sharing intentional data and so it
is collected at unexpected scale. Decisions are made
today on the basis of this data which were previously
made on guess or on the models of reality. Big data
analysis now drives every aspect of modern
application, gadgets, industry as well as society.
Text within big data, such as data form
newspapers, magazines, web pages, emails, blogs,
tweets, is particularly important because there are
sources of information that has valuable information
for humans. To utilize this large amount of text data it
requires techniques for processing the data.
Those techniques must have following
characteristics:
(1) They must be fast & accurate in processing.
(2) Must be able to find Relationship with other
information.
(3) Must remove 100% ambiguity from data.
(4) Must handle heterogeneous data efficiently.
Steps in big data analysis
Flow chart in fig.1 shows big data steps for
retrieving or extracting important & valuable
information.
The analyzing steps shown in fig.1 for big data
querying and mining are very different from
traditional analysis methods that are worked out on
small amount of data. Big data is often noisy,
dynamics, interrelated and untrustworthy but
nevertheless even noisy big data could be more
valuable than small samples because statistics
obtained from frequent pattern and correlated analysis
usually disclose more reliable hidden patterns
knowledge.
Analyzing step can extract meaningful and related
information by processing the text which is in natural
language, but it is not so easy to analyze using simple
regression models or decision trees. However the
group of technique called as text analytics can help to
get deep information from these sources by translating
this complex textual information into useful signals
that can give deeper analysis.
Various sources of data
Recording data
Cleaning data
Integrating
Analyzing
Interpreting
Fig1. Steps in big data analysis
2. LITERATURE REVIEW
Based on a survey of over 4,000 information
technology (IT) professionals from 93countries and
25 industries, the IBM Tech Trends Report (2011)
identified business analytics as one of the four major
technology trends in the 2010s. In a survey of the
state of business analytics by Bloomberg
2. International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
E-ISSN: 2321-9637
167
Businessweek (2011), 97 percent of companies with
revenues exceeding $100 million were found to use
some form of business analytics. A report by the
McKinsey Global Institute (Manyika et al. 2011)
predictedthat by 2018, the United States alone will
face a shortage of 140,000 to 190,000 people with
deep analytical skills, as well as a shortfall of 1.5
million data-savvy managers with the know-how to
analyze big data to make effective decisions[1]
Emerging analytics research opportunities can be
classified into five critical technical areas—(big) data
analytics, text analytics, web analytics, network
analytics, and mobile analytics—all of which can
contribute to BI&A[1].
In “hype cycle for big data” report (July 2013),
Gartner positions text analytics as delivering great
business benefits and project adoption in next two to
five years[2].
Text analytics has its academic roots in
information retrieval and computational linguistics. In
information retrieval, document representation and
query processing are the foundations for developing
the vector-space model, Boolean retrieval model, and
probabilistic retrieval model, which in turn, became
the basis for the modern digital libraries, search
engines, and enterprise search systems[3]. In
computational linguistics, statistical natural language
processing (NLP) techniques for lexical acquisition,
word sense disambiguation, part-of-speech-tagging
(POST), and probabilistic context-free grammars have
also become important for representing text [4].
Text analytics Academic roots
Document representation
Information retrieval Query processing
Application
Digital libraries, Search engines
NLP
Computational linguistic CFG
POST
Application
for representing text
3. ALGORITHMS FOR TEXT ANLYTICS
Text analytics refers to the process of deriving
high quality information from text. Information is
derived by finding/learning patterns and trends in it
using statistical methods.
To extract specific type of information from text
data there are many algorithms but which to apply
depends on the type of data analysis project at hand.
Some projects are clear in objectives and certain are
just trying to deep inspect the data and get some
valuable data from mass of information where the
outcome is known, which can then be used for further
analysis.
As per the project in hand by making use of
statistical methods based on frequency matrix (counts
words appearing in various text sources) or term
document matrix (lists all the unique terms in the text
which is examined) often gives a new useful feature
after applying proper statistical technique. However
this technique gives intermediate results which can be
used as foundation for further analysis. In this type of
method after examining documents a scorecard is
prepared showing score of every term in respect of
number of times it appears in the document then by
applying a threshold only those terms are collected
that is above threshold which is then used to construct
a larger concept
Other text analytic technique may make use of
Named Entity Extraction (NEE) , it is a method that
identifies every smallest element in text and classifies
them into predefined entities like person, place,
product, date etc.
Making use of NEE, probabilities can be set that a
particular document refers to an named entity.
NEE is based on natural language processing
(NLP). After analyzing the structure of text, NEE
generates a score foe every entity that is identified
from that text. Considering the score for every entity
and applying threshold, those entities can be used in
creation of structured features and make use of it
further in prediction models.
NEE has been successfully developed for news
analysis and biomedical application [1].
Another text analytic technique which is widely
used in emerging areas like topic model is LDA
(Latent Drichelt allocation). LDA is mainly used for
finding main topic/themes that are in every part of a
large unstructured collection of documents and it is
also useful for detecting changes in customer
behavior.
LDA is an unsupervised method that is applied on
unstructured data. In comparison with NEE it is not
NLP based rather it looks for pattern in text.
LDA can be applied to any type of structured,
unstructured, semi structured data from any number of
sources to identify patterns in text.
Text analytics are widely used in emerging areas
like information extraction, topic models and
opinion/sentiment analysis.
While working on text in big data, where exactly
is text analytics technique applied. Text data are
typically held as notes, documents and various forms
of electronic correspondence (emails for example).
Structured data on the other hand are usually
contained in databases with fixed structures. Many
data mining techniques have been developed to
extract useful patterns from structured data and this
process is often enhanced by the addition of variables
(called features) which add new ‘dimensions’,
providing information that is not implicitly contained
3. International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
E-ISSN: 2321-9637
168
in existing features. The appropriate processing of text
data can allow such new features to be added,
improving the effectiveness of predictive models or
providing new insights [5].
Let us consider an example for customer
information. Structured data about an customer is
stored in database using fields like name, salary and
order value and there be an unstructured texts about
customer information. Both structured and
unstructured text is lacking of customer ID. By
applying text analytics technique like NEE or LDA
Structured Data Unstructured Data
Name Salary Order
Raj 45,000 3400
Jyoti 65,000 4500
Text Analytics
New
Feature
Data Mining
Fig 2. Where to apply text analytics?
One can infer the nature, strength or absence of
relationships among individuals and yield a new
feature like sentiment and then by mining data
predictive pattern are created that can be used in
application and create valuable information from it.
4. OBSTACLES IN TEXT ANALYTICS
Although the field is new, text analytics is
achieving a level of development that makes its
widespread use.
Following are the obstacles for adoption of text
analytics
1. How to deploy the results?
2. How to handle heterogeneous data?
3. Lack of methods to determine what
exactly is in text.
Recent technological development has overcome
these problems.
5. CONCLUSION
The real strength of big data lies in its utilization.
Big data has huge amount of text data and it can be
used for extracting sensible information. Making use
of big data for training a classifier, applying NLP
along with text analytics technique can give valuable
information from raw text data. This paper discusses
how NEE, LDA and term matrix can be used to
extract information from large unstructured text.
REFERENCES
[1] Hsinchun Chen.; Roger H. L. Chiang.; Veda C.
Storey (2012). Business intelligence and
analytics:From big data to big impact MIS
Quarterly Vol. 36 No. 4, pp. 1165-
1188/December .
[2] https://www.gartner.com/doc/2574616
[3] Salton, G. 1989. Automatic Text Processing,
Reading, MA: Addison Wesley
[4] Manning, C. D., and Schütze, H. 1999.
Foundations of Statistical Natural Language
Processing, Cambridge, MA: The MIT Press.
March, S. T., and Storey, V. C. 2008. “Design
Science in the Information Systems Discipline,”
MIS Quarterly (32:4), pp. 725-730.
[5] http://butleranalytics.com/unstructured-meets-structured-
data/
[6] http://www.informs.org/
[7] http://www.twitter.com. Twitter, Inc.
[8] http://www.google.com. Google, Inc.
[9] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
dirichlet allocation. J.Mach. Learn. Res., 3:993–
1022, March 2003.
[10] J. Chen, R. Nairn, L. Nelson, M. Bernstein, and
E. Chi. Short and Tweet: Experiments on
Recommending Content from Information
Streams. CHI 2010.
[11]W. Dou, X. Wang, R. Chang, and W. Ribarsky.
ParallelTopics: A Probabilistic Approach to
Exploring Document Collections. In Visual
Analytics Science and Technology (VAST), 2011
IEEE Conference on, 2011.
Name Salary Order Sentiment
Raj 45,000 3400 Negative
Jyoti 65,000 4500 Positive
Patterns
4. International Journal of Research in Advent Technology, Vol.2, No.6, June 2014
E-ISSN: 2321-9637
169
[12]Salton, G & Buckley, C. 1988. Term-weighting
approaches in automatic text retrieval.
Information Processing & Management 24
(5):513-523.
[13]X. Wei and W. B. Croft. Lda-based document
models for ad-hoc retrieval. In Proceedings of the
29th annual international ACM SIGIR conference
on Research and development in information
retrieval,SIGIR ’06, pages 178–185, New York,
NY, USA, 2006. ACM