Emcien overview v6 01282013

The Big Data Paradigm Shift:

Insight Through Automation

In this white paper, you will learn about:
• Big Data’s combinatorial explosion
• Current and emerging technologies
• Automation as the new way to leverage
insight within Big Data
• An algorithmic approach to the Big
Data revolution

2

w w w. e m c i e n . c o m

CONTENTS
01 / EXECUTIVE SUMMARY
page 5
02 / THE BIG DATA
PARADIGM SHIFT
page 6
03 / LANDSCAPE OF EXISTING
METHODS AND TOOLS
page 8
04 / EMERGENCE OF
BIG DATA TOOLS
page 10
05 / THE NEED FOR A NEW BIG
DATA ANALYTICS APPROACH
page 11
06 / EMCIEN’S ALGORITHMIC
APPROACH TO BIG DATA
page 13
07 / NOT JUST THEORY:
SOLVING REAL-WORLD
PROBLEMS
page 15
08 / CONCLUSION
page 16


3

... a study of progress over a 15-year span on a
benchmark production-planning task. Over that
time, the speed of completing the calculations
improved by a factor of 43 million. Of the
total, a factor of roughly 1,000 was attributable
to faster processor speeds. Yet a factor of
43,000 was due to improvements in the
efficiency of software algorithms.”
Martin Grotschel, a German scientist and mathematician
White House Advisory Report
DECEMBER 2010

4


Data used to be scarce, and
tiny bits of it were extremely
valuable. Tiny bits of it are
still extremely valuable, but
now there is so much data
that finding the valuable bits
can be extremely difficult.”

Executive Summary
Big Data promises greater insight, competitive advantage, and the possibility of solving problems that
have yet to be imagined. These insights will come from software applications that automate the analytics
process. While infrastructure may be an important building block, this paper focuses on the algorithms
that will deliver the insights organizations need.
This paper does the following:
• Proposes a paradigm shift away from analysts’ one-to-one relationship with data toward a relationship
with algorithms.
• Explains the combinatorial explosion that makes Big Data Analytics impossible for old data analytics tools
and methodologies.
• Examines current and emerging technologies.
• Suggests that methods which accommodate Big Data users in the same way that smaller data sets are
handled are missing the potential of Big Data.
• Proposes that rather than search and crunch data, organizations need the ability to automate the
process of analyzing, visualizing and ultimately leveraging the insight within their data.
• Introduces an algorithmic approach that provides an efficient, sustainable, automated way to delve
into Big Data, detect patterns, and discover insights hidden within that data.


5

The Big Data Paradigm Shift
The Need for a Paradigm Shift
Humankind has always possessed a love for data.
We can do remarkable things with data and have
built remarkable tools to collect, store, sift, sort,
splice, dice, chart, report, predict, and visualize it.
Data can change the way we perceive the world and
how we interact with it. But the world is changing.
Data used to be scarce, and tiny bits of it were
extremely valuable. Tiny bits of it are still extremely
valuable, but now there is so much data that finding
the valuable bits can be extremely difficult.
Big Data demands that organizations change the
way they interact with data. In the past, analysts
could stand by the data faucet and collect what was
needed in a paper cup, but now data is the ocean
in which they are floating. Paper cups are useless
here. Why? Because the search for data is over.
Data is everywhere. It creates noise. Now analysts
are searching for the signal amidst the noise. They
are looking for the important bits. And in this vast
sea of information, that task is overwhelming.
Current tools and methodologies are failing when it
comes to finding the most critical information in a
time-sensitive, cost-effective manner. The reason?
Many of these emerging tools and technologies are
trying to approach this challenge with new ways of
doing the same old things. But a bigger cup is not
what’s needed. That won’t solve it. What is needed
is a completely new approach to Big Data Analytics.
In this new approach, the only way to find the
signal is to automate the process of data-to-insight
conversion. And automation requires algorithms—
fast, sophisticated, highly optimized algorithms.
The volume of Big Data demands a change in the
human relationship with data, from human to
machine. The algorithms have to do the work, not
the humans. In this brave new world, the machines
and algorithms are the protagonists. The role of
the analysts will be to select the best algorithms
and approve the quality of results based on speed,
quality and economics.

6

The Promise of Data and the
Search for Insight
Why is the world obsessed with data? Because
the promise of data is insight. In the last few
years, organizations have become exceptionally
good at collecting data, and as the cost of storage
has dropped, companies are now drowning
in that “Big Data.” However, the business
world has hit a wall where the amount of
data
available
far
exceeds
the
human
capacity to process it. The amount of data
also
exceeds
the
capabilities
of
existing
analytics and intelligence tools, which have served
as mere data-shovels or pick axes in the search for
the gold that is insight.
Acquiring insight inevitably involves querying a
database, or, most likely, several databases. Many
analysts are mashing up data across data silos in an
attempt to discover the connections between data
points. For example, marketers are aggregating
customer demographics, purchase data and social
media data, while purchasers are aggregating
supplier data with procurement and pricing data.
This process produces a variety of data sets of
different types and qualities.
A simple query, for example, might ask for specific
values within a subset of columns. So the real
question becomes, “How many queries will it take
to answer even one of these questions?” Consider
how many queries one might make into even a
small set of data, such as a table containing just
ten columns:
• If each column has 2 possible values, there are
59,048 possible queries.
• If each column has 3 possible values, there are
1,048,575 possible queries.
To think of it another way, a database with 100 columns
and 6 choices per column yields more possible queries
than there are atoms in the universe.


The Limitations of Search and
Query-based Approaches
Data becomes unwieldy due to the number of rows
or the number of columns, or both. Data mash-ups,
described previously, create a lot of columns. High
volume transactional systems have lots of rows or
records. However, having millions of records is not
the problem. The depth of the data—the number of
rows—merely impacts processing time in a linear
fashion and can be reduced with fast or parallel
computing. Thus, executing a query is simple
enough. The problem, however, is the width of the
data because the query explodes exponentially
based on the number of columns.

Number of Queries

As a result, the real task of extracting insight from
data is formulating the right queries. And manually
laboring through thousands of queries to find the
ones that deliver insight is not an efficient way to
derive value from data. Therefore, when it comes to
Big Data, the big challenge is knowing the right query.

30 Billion

Exponential
Explosion
of Queries

10 Billion

1 Billion

1 Million

0

2

4

6

8

10

12

14

Number of Variables per Column


7

Landscape of Existing Methods and Tools
Over 85% of all data is unstructured. 1 However,
existing methods and tools are designed to analyze
structured data. A high level categorization of
analytics tools is critical to understanding the state
of Big Data Analytics.

Statistical Tool Kits
The purpose of statistical analysis is to make
inferences from samples of data, especially when
data is scarce. In the era of Big Data, scarcity is not
the problem. Traditional statistical methods have
severe limitations in the realm of Big Data for the
following reasons:
• Statistical
methods
break
down
as
dimensionality increases.
• In unstructured data, dimensions are not
well defined.
• Attempts to define dimension for unstructured
data result in millions of dimensions.

Data Mining
Data mining is a catchall phrase for a very broad
category of methods. Essentially, it is a method
for sifting through very large amounts of data
in attempt to find useful information. It implies
“digging through tons of data” to uncover patterns
and relationships contained within the business
activity and history. Data mining involves manually
slicing and dicing the data until a pattern becomes
obvious or by using software that analyzes the
data automatically.
The first limitation of data mining is that the data
has to be put in a structured format first, such as a
database. The second limitation is that most forms
of data mining require that the analyst knows
what to look for. For example, in classification and
clustering analysis, the analyst is trying to find
instances of known categories, such as people
who have a high probability of defaulting on their
mortgages. In anomaly detection, the analyst is
looking for instances that do not match the known
normal patterns or known suspicious patterns, such
as people who pay cash for one-way plane tickets.

8

The overwhelming shortcoming
of all these methods is that
they are query-based and
labor intensive.”

Data Visualization
Data visualization is the study of the visual representation
of data, meaning information that has been abstracted
in some schematic form, including attributes or variables
for the units of information.2 Humans are better
equipped to consume visual data than text. As we know,
a picture is worth a thousand words.
While visualization tools are interesting, they rely on
human evaluation to extract insight and knowledge. The
more severe limitation of visualization is that the visuals
can only focus on two or three dimensions at the most
before the amount of information is overwhelming. The
most common limitation of visualization is that while it
is a good test for small samples, it is not a sustainable
method to gain insight into large volumes of higher
dimensionality data.
Consider a scenario in which there aren’t enough pixels
on the screen to represent each item. An analyst can
easily inspect a friendship network with 10-100 people,
but not on billion.

Business Intelligence & Analytics
Business Intelligence (BI) is a catchall phrase for ad
hoc reports created in a database. These are typically
pre-canned reports based on metrics that users
are comfortable reporting. “Analytics” includes any
computation performed for reporting. Hence, BI tools
are now called analytics. BI was created as a way to
extract data from the database. While it continues to
serve that purpose, it is time and labor intensive and is
not intended to surface insights.


Limitations of Existing Tools
The overwhelming shortcoming of all these methods
is that they are query-based and labor-intensive.
Big Data offers an infinite number of queries, causing
all these methods to rely on analysts to produce
questions. Any method that puts the burden on the
user is a game-stopper.
Although search remains the go-to information access
interface, reliance on search needs to end. Search is
not enough. A new type of information-processing
focus is needed.

The major shortcomings of the existing tools are as follows:
• Because search helps you discover insights you already know
about, it doesn’t help you discover things about which you’re
completely unaware.
• Query-based tools are time-consuming because search-based
approaches require a virtually infinite number of queries.
• Statistical methods are largely limited to numerical data; over
85% of data is unstructured.


9

Emergence of Big Data Tools
Because Big Data includes data sets with sizes beyond the ability of commonly-used software tools to
capture, curate, manage, and process within a tolerable elapsed time, new technologies are emerging to
address the challenges brought on by these large quantities of data. These technologies can be categorized
into two groups: Hadoop-based solutions and In-Memory based solutions.

Hadoop and Hadoop-based Tools
While Hadoop is not an analytics tool per se, it is
often confused as being one. Apache Hadoop is
an open-source software framework that supports
data-intensive distributed applications. It supports
the running of applications on large clusters of
commodity hardware.
Hadoop is used to break big tasks into smaller
ones so that they can be run in parallel to gain
speed and efficiency. This is great for a query on
a large volume data set. The data set can be cut
into smaller pieces, and the same query can be run
on each smaller set. Hadoop aims to lower costs
by storing data in chunks across many inexpensive
servers and storage systems. The software can
help speed up certain types of simple calculations
by sending many queries to multiple machines at
the same time. The technology has spawned a set
of new start-ups, such as Hortonworks Inc. and
Cloudera Inc., which help companies implement it.
Hadoop helps companies store large amounts of
data but doesn’t provide critical insights based on
the naturally occurring connections within the data.
The impulse to store lots of data because it is cheap
to do so can lead to storing too much data, which
can make answering simple questions more difficult.

10

IT professionals and analysts are asking the
following questions:
• Where is the insight? What is the data telling us?
• How can I prove the return of investment?
As a result, in-memory databases are gaining
attention in an attempt to come closer to the goal
of real-time business processing. 3,4

In-Memory-based Appliance
Some of these approaches have been around for a
long time in areas such as telecommunications or
fields related to embedded databases. An example
is SAP’s HANA (High Performance Analytical
Appliance). This in-memory paradigm is now touted
as the future database paradigm for Big Data.
The primary limitation of in-memory is cost and
size. There is a significant limit to the amount of
data that can be held in memory. If you need to
perform Big Data–style analysis and you want to
see the bigger picture, in-memory is not enough.
The cost is prohibitive and it is not sustainable.


The Need for a New Big Data Analytics Approach
While these emerging technologies are attempting to address the challenge of Big Data, at the end of the day,
they are heavy-handed and time-consuming because they lack automated intelligence for gaining insight. It’s
time for an entirely new approach.
This new approach demands a paradigm shift that focuses on the following:
• A fundamental change in the role played by analysts from data-miners to insight-evaluators.
• Fast and efficient algorithms that automatically convert data to insight for evaluation.
• Continual improvement of these algorithms to keep up with the speed of data and critical need for
timely insights.

Old Paradigm:

New Paradigm:

Data Analyst Digs for Insights by
Manually Querying a Database

Algorithms Automatically Surface
Insights to Evaluate

Analysis Takes From
Months to Years

Automatic Insights in
Seconds to Minutes

Specialized Skills in Math
and Computer Science Required

Anyone
(No Specialized Skills Required)

Operational and Business Intelligence

Immediate Insight and Perspective


11

Emergence of Algorithms as a
New Class of Big Data Software Tools
The size and speed of Big Data demands true automation,
in which work is offloaded from human to machine. This
automation happens with algorithms, which are designed
for calculation, data processing, and automated reasoning.
Algorithms are designed for tasks that are beyond human
comprehension and require the speed of machines. This is
the realm of Big Data.
One of the most dramatic and game-changing examples of
an algorithm was designed by Alan Turing to automatically
decode German Navy messages at Bletchley Park during
WWII. In this instance, the urgency was critical and
demanded an automated approach to convert the data
to intelligence. There were 158 million million million
(158,000,000,000,000,000,000) possible ways that a
message could be coded by the German Enigma machine.
The decoder algorithm, Shark, worked and the results
changed the course of the war. The Allies won because they
had a competitive advantage. Bringing it to the present,
the use of Big Data in the 2012 United States presidential
election changed the face of political campaigns forever.
Emcien’s approach to Big Data is to automate the process
of data-to-insight in a timely and cost effective manner
through sophisticated algorithms. The algorithms leverage
advanced mathematics to solve complex problems of
an unimaginable size, thereby pushing the frontier of
innovation and competition. The following section details
Emcien’s algorithmic approach to Big Data Analytics.

12


Emcien’s Algorithmic Approach to Big Data
Rather than search and crunch data, organizations need the ability to analyze, visualize and ultimately leverage
the patterns and connections within their data. Emcien’s innovation is a suite of automatic pattern detection
algorithms. These algorithms utilize a graph data model that captures the interconnectedness of the data
elements and creates a very elegant representation of high volume data with unknown structure. These fast,
sophisticated algorithms automatically detect patterns and self-organize what they find thereby providing
immediate insight and perspective.

Here is an outline of how the algorithm works:
1. Assesses the data in order to identify and measure connections between data points.
2. Converts the original high density/low value (structured, semi-structured or unstructured) data into a low
density/high value graph.
Structured

If the data is structured:
• Each row in the data table is considered an event.
• Each cell in the row is converted to a node in the graph.
• Cells that co-occur in a row (event) are connected by an arc.

Unstructured

If the data is unstructured:
• Every word or data element is converted to a node.
• Two words or data elements that occur simultaneously in an event are
connected by an arc.
• An event may be defined as a single document, message, email exchange, etc.

3. Builds the graph on non-Euclidean distances. This is important, as most of the data is unstructured and
non-numeric. The distances and strengths are computed in non-Euclidean space. (For example – you may
be closer to your family than your friends – but this is not Euclidean space!).
4. Computes millions of data points across the graphs to enable patterns to emerge.
5. Distills the noise, to allow the signal to emerge. This is made possible because the noise has patterns and
the algorithms are designed to detect these patterns.
6. Enables the key topographical elements from the graph to emerge. The algorithm will then rank and focus
these elements.
7. Categorizes these elements based on the application and output the insight.


13

Understanding the make-up of Graphs based on Connections
The graph data model is very flexible and displays a distinct topography based on the density of connections. A
cross sectional view of the graph data model will typically expose the following layers Layer

Connectedness

Description

1

Very Noisy Connections

Typically the most highly prevalent data, this layer is composed of high
volume interactions that may be mundane and blatantly obvious.

2

Highly Connected Nodes

Lying just below the noise, this second layer is composed of the first
signal that is interesting. This layer exhibits distinct patterns based on
crowd behavior.

3

Weaker Connections

The third layer is a weaker signal and displays the non-obvious
connections. These relate to events that are less frequent and may be
connected in non-obvious ways.

4

The Faint Signal

Composed of very weak connections and interactions, this last layer is
of interest for security and surveillance. In many cases, this layer only
emerges when the data is very rich in entities, causing connections to
emerge in very non-obvious ways.

Advantages of Emcien’s Graph Data Model
The graph data model exhibits a topography that signifies relationships and connectedness in a way that is not
possible through any other method. Emcien’s algorithms have been designed to surface these patterns. Listed
below are a few key attributes that help describe the characteristics of the algorithms.
Attribute

Advantage

Software

A critical distinction of Emcien’s graph data model is that it is software.
Emcien’s software provides the computational engine with a data representation that
lends itself to high-speed computing. As a result, the software runs on typical commodity
computing environments.

Algorithmic
Layer

Compact
Representation
of Data

Data is big because the number of events can grow exponentially as the various entities are
continually interacting.

Noise
Elimination

14

Although some products on the market model the graph into the database layer or hardware
layer, they do not have an algorithmic layer, and, therefore, require the user to query the
systems based on the old data-inquiring paradigm. Algorithms automate the data analysis
process that is an absolute requirement for efficient Big Data analytics.

The graph data model can be thought of in terms of layers, based on the connectedness of
the data elements. The highly connected and noisy nodes are at the top layer, and the weak
connections lay buried deep in the graphs. The noisy connections can be overwhelming
and tend to render graph models as burdensome. Emcien utilizes a suite of patented
algorithms to automatically distance the noise and detect critical patterns that relate to
highly significant and relevant information.

The graph representation is ideal for Big Data because it creates a very compact
representation of the data. This is because the number of entities grows more slowly
and reaches a natural steady state. In the graph data model these interactions translate
to connection weights, allowing the graph model to encapsulate very big data in
smaller structures.


Not Just Theory: Solving Real-World Problems
The representation and visualization of complex networks as graphs helps surface critical, time-sensitive
intelligence. One of the most important tasks in graph analysis is to identify closely connected network
components comprising nodes that share similar properties. Detecting communities is of significant value
in retail, healthcare, banking and intelligence work - verticals where loosely federated communities deliver
insight and intelligence into the profile of a customer base or any other group being analyzed.

How Can This Model Be Applied? Emcien’s Pattern Detection Engine:
• Intelligence: Surfaces critical correlations
between people that merit serious attention,
determines key individuals in targeted social
networks, and geo-locates persons of interest
and their networks around the world – from
gangs to terrorists.
• Network Security: Auto-detects intrusion patterns
and surfaces suspicious activity by providing
immediate insight into highly linked variables. It
then automatically identifies anomalies without
the user having to query the data. For example,
Emcien analyzes millions to billions of transactions
to identify patterns in source and destinationIP addresses, ports, days, times and activity – to
show you what you should be paying attention
to. Emcien eliminates over 95% of the noise and
identifies patterns that are “surprising” or that
deviate from the norm.

• Fraud Detection: Surfaces patterns in money
laundering and fraud by identifying groups of
customers, locations, or transaction types that
occur together in banking transactions,
• Customer Analytics: Surfaces insights on customer
buying patterns, locations, demographics, loyalty,
savings, lifestyle and insurance.
• Healthcare Analytics: Analyzes massive volumes
of clinical data on medications, allergies, medical
claims, pharmacy therapies, lab results, medical
records, clinician notes and more in order to
surface patterns.
• Performance and Operations Analytics: Analyzes
raw information about performance and operations
of every element of an organization which can be
interpreted to increase profitability or improve
customer service.

In short, Emcien tackles one of the biggest challenges with Big Data, namely “What are the right questions
to ask?” Emcien’s pattern-detection engine quickly discovers the value within massive data sets by making
connections between disparate, seemingly unrelated bits of information and by finding the highest-ranked of
these connections to focus on – which reveals time-sensitive, mission-critical insights.


15

Conclusion
The Big Data Analytics revolution is underway. This
revolution is a historic and game-changing expansion of
the role that information plays in business, government
and consumer realms. To harness the power of this data
revolution, a paradigm shift is required. Organizations
must be able to do more than query their Big Data
stores; search is no longer enough.
Up until now in the history of data analysis, the objective
of queries was to find the signal in the noise. And it
worked because we had clear-cut business questions
and the size of the data was smaller, the data set was
more complete, and we usually knew what we were
looking for. We were playing in the realm of known
knowns and known unknowns. In the new world of Big
Data, it is now more important to know what to ignore.
Because unless you know what to ignore, you’ll never
get a chance to pay attention to what’s really important.
Using algorithms to first ignore the noise and then find
the insights is the way of the new world.

Extracting insight from Big Data requires analytics
methods that are fundamentally different from
traditional querying, mining, and statistical analysis
on small samples. Big Data is often noisy, dynamic,
heterogeneous, unstructured, inter-related and
untrustworthy.
The combinatorial explosion requires new methods
for finding insight in Big Data. The need for data
sophistication is due to economics and time-criticality.
As stated, manually laboring through thousands of
queries to find the ones that deliver insight is not an
efficient way to derive value from data.
Emcien’s technology provides a “Command Center”
for Big Data, automatically interpreting the data,
discovering patterns, identifying complex and significant
relationships, and surfacing the most relevant questions
that lead to the insights analysts need to know.

About Emcien Corp.
Emcien’s automatic pattern-detection engine converts data to actionable insight that organizations can use
immediately. Emcien breaks through time, cost and scale barriers that limit the ability to operationalize the
value of data for mission-critical applications. Our patented algorithms recognize what’s important, defocus
what’s not, evaluate all possible combinations and deliver the optimal results automatically. Emcien’s engine,
fueled by several highly competitive NSF grants and years of research at Georgia Tech and MIT, is delivering
unprecedented value to organizations across sectors that depend on immediate insight for success—banking,
healthcare, insurance, retail, Intelligence and others. Visit emcien.com to learn more.

16


Sources
1. Christopher C. Shilakes and Julie Tylman, “Enterprise Information Portals”, Merrill Lynch, 16 November, 1998.
2. Michael Friendly, “Milestones in the history of thematic cartography, statistical graphics, and data visualization,” 2008.
3. J. Vascellaro, “Hadoop Has Promise but Also Problems,” The Wall Street Journal, 23 February 2012.
4. R. Srinivasan, “Enterprise Hadoop: Five Issues With Hadoop That Need Addressing,” blog, 28 May 2012.

Emcien overview v6 01282013

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Emcien overview v6 01282013

Semelhante a Emcien overview v6 01282013 (20)

Último

Último (20)

Emcien overview v6 01282013