SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
The Big Data Paradigm Shift:

Insight Through Automation
In this white paper, you will learn about:
•	Big Data’s combinatorial explosion
•	Current and emerging technologies
•	Automation as the new way to leverage
insight within Big Data
•	An algorithmic approach to the Big
Data revolution

2

w w w. e m c i e n . c o m
CONTENTS
01 / EXECUTIVE SUMMARY
page 5
02 / THE BIG DATA
PARADIGM SHIFT
page 6
03 / LANDSCAPE OF EXISTING
METHODS AND TOOLS
page 8
04 / EMERGENCE OF
BIG DATA TOOLS
page 10
05 / THE NEED FOR A NEW BIG
DATA ANALYTICS APPROACH
page 11
06 / EMCIEN’S ALGORITHMIC
APPROACH TO BIG DATA
page 13
07 / NOT JUST THEORY:
SOLVING REAL-WORLD
PROBLEMS
page 15
08 / CONCLUSION
page 16

w w w. e m c i e n . c o m

3
... a study of progress over a 15-year span on a
benchmark production-planning task. Over that
time, the speed of completing the calculations
improved by a factor of 43 million. Of the
total, a factor of roughly 1,000 was attributable
to faster processor speeds. Yet a factor of
43,000 was due to improvements in the
efficiency of software algorithms.”
Martin Grotschel, a German scientist and mathematician
White House Advisory Report
DECEMBER 2010

4

w w w. e m c i e n . c o m
Data used to be scarce, and
tiny bits of it were extremely
valuable. Tiny bits of it are
still extremely valuable, but
now there is so much data
that finding the valuable bits
can be extremely difficult.”

Executive Summary
Big Data promises greater insight, competitive advantage, and the possibility of solving problems that
have yet to be imagined. These insights will come from software applications that automate the analytics
process. While infrastructure may be an important building block, this paper focuses on the algorithms
that will deliver the insights organizations need.
This paper does the following:
•	 Proposes a paradigm shift away from analysts’ one-to-one relationship with data toward a relationship
with algorithms.
•	 Explains the combinatorial explosion that makes Big Data Analytics impossible for old data analytics tools
and methodologies.
•	 Examines current and emerging technologies.
•	 Suggests that methods which accommodate Big Data users in the same way that smaller data sets are
handled are missing the potential of Big Data.
•	 Proposes that rather than search and crunch data, organizations need the ability to automate the
process of analyzing, visualizing and ultimately leveraging the insight within their data.
•	 Introduces an algorithmic approach that provides an efficient, sustainable, automated way to delve
into Big Data, detect patterns, and discover insights hidden within that data.

w w w. e m c i e n . c o m

5
The Big Data Paradigm Shift
The Need for a Paradigm Shift
Humankind has always possessed a love for data.
We can do remarkable things with data and have
built remarkable tools to collect, store, sift, sort,
splice, dice, chart, report, predict, and visualize it.
Data can change the way we perceive the world and
how we interact with it. But the world is changing.
Data used to be scarce, and tiny bits of it were
extremely valuable. Tiny bits of it are still extremely
valuable, but now there is so much data that finding
the valuable bits can be extremely difficult.
Big Data demands that organizations change the
way they interact with data. In the past, analysts
could stand by the data faucet and collect what was
needed in a paper cup, but now data is the ocean
in which they are floating. Paper cups are useless
here. Why? Because the search for data is over.
Data is everywhere. It creates noise. Now analysts
are searching for the signal amidst the noise. They
are looking for the important bits. And in this vast
sea of information, that task is overwhelming.
Current tools and methodologies are failing when it
comes to finding the most critical information in a
time-sensitive, cost-effective manner. The reason?
Many of these emerging tools and technologies are
trying to approach this challenge with new ways of
doing the same old things. But a bigger cup is not
what’s needed. That won’t solve it. What is needed
is a completely new approach to Big Data Analytics.
In this new approach, the only way to find the
signal is to automate the process of data-to-insight
conversion. And automation requires algorithms—
fast, sophisticated, highly optimized algorithms.
The volume of Big Data demands a change in the
human relationship with data, from human to
machine. The algorithms have to do the work, not
the humans. In this brave new world, the machines
and algorithms are the protagonists. The role of
the analysts will be to select the best algorithms
and approve the quality of results based on speed,
quality and economics.

6

The Promise of Data and the
Search for Insight
Why is the world obsessed with data? Because
the promise of data is insight. In the last few
years, organizations have become exceptionally
good at collecting data, and as the cost of storage
has dropped, companies are now drowning
in that “Big Data.” However, the business
world has hit a wall where the amount of
data
available
far
exceeds
the
human
capacity to process it. The amount of data
also
exceeds
the
capabilities
of
existing
analytics and intelligence tools, which have served
as mere data-shovels or pick axes in the search for
the gold that is insight.
Acquiring insight inevitably involves querying a
database, or, most likely, several databases. Many
analysts are mashing up data across data silos in an
attempt to discover the connections between data
points. For example, marketers are aggregating
customer demographics, purchase data and social
media data, while purchasers are aggregating
supplier data with procurement and pricing data.
This process produces a variety of data sets of
different types and qualities.
A simple query, for example, might ask for specific
values within a subset of columns. So the real
question becomes, “How many queries will it take
to answer even one of these questions?” Consider
how many queries one might make into even a
small set of data, such as a table containing just
ten columns:
•	 If each column has 2 possible values, there are
59,048 possible queries.
•	 If each column has 3 possible values, there are
1,048,575 possible queries.
To think of it another way, a database with 100 columns
and 6 choices per column yields more possible queries
than there are atoms in the universe.

w w w. e m c i e n . c o m
The Limitations of Search and
Query-based Approaches
Data becomes unwieldy due to the number of rows
or the number of columns, or both. Data mash-ups,
described previously, create a lot of columns. High
volume transactional systems have lots of rows or
records. However, having millions of records is not
the problem. The depth of the data—the number of
rows—merely impacts processing time in a linear
fashion and can be reduced with fast or parallel
computing. Thus, executing a query is simple
enough. The problem, however, is the width of the
data because the query explodes exponentially
based on the number of columns.

Number of Queries

As a result, the real task of extracting insight from
data is formulating the right queries. And manually
laboring through thousands of queries to find the
ones that deliver insight is not an efficient way to
derive value from data. Therefore, when it comes to
Big Data, the big challenge is knowing the right query.

30 Billion

Exponential
Explosion
of Queries

10 Billion

1 Billion

1 Million

0

2

4

6

8

10

12

14

Number of Variables per Column

w w w. e m c i e n . c o m

7
Landscape of Existing Methods and Tools
Over 85% of all data is unstructured. 1 However,
existing methods and tools are designed to analyze
structured data. A high level categorization of
analytics tools is critical to understanding the state
of Big Data Analytics.

Statistical Tool Kits
The purpose of statistical analysis is to make
inferences from samples of data, especially when
data is scarce. In the era of Big Data, scarcity is not
the problem. Traditional statistical methods have
severe limitations in the realm of Big Data for the
following reasons:
•	 Statistical
methods
break
down
as
dimensionality increases.
•	 In unstructured data, dimensions are not
well defined.
•	 Attempts to define dimension for unstructured
data result in millions of dimensions.

Data Mining
Data mining is a catchall phrase for a very broad
category of methods. Essentially, it is a method
for sifting through very large amounts of data
in attempt to find useful information. It implies
“digging through tons of data” to uncover patterns
and relationships contained within the business
activity and history. Data mining involves manually
slicing and dicing the data until a pattern becomes
obvious or by using software that analyzes the
data automatically.
The first limitation of data mining is that the data
has to be put in a structured format first, such as a
database. The second limitation is that most forms
of data mining require that the analyst knows
what to look for. For example, in classification and
clustering analysis, the analyst is trying to find
instances of known categories, such as people
who have a high probability of defaulting on their
mortgages. In anomaly detection, the analyst is
looking for instances that do not match the known
normal patterns or known suspicious patterns, such
as people who pay cash for one-way plane tickets.

8

The overwhelming shortcoming
of all these methods is that
they are query-based and
labor intensive.”

Data Visualization
Data visualization is the study of the visual representation
of data, meaning information that has been abstracted
in some schematic form, including attributes or variables
for the units of information.2 Humans are better
equipped to consume visual data than text. As we know,
a picture is worth a thousand words.
While visualization tools are interesting, they rely on
human evaluation to extract insight and knowledge. The
more severe limitation of visualization is that the visuals
can only focus on two or three dimensions at the most
before the amount of information is overwhelming. The
most common limitation of visualization is that while it
is a good test for small samples, it is not a sustainable
method to gain insight into large volumes of higher
dimensionality data.
Consider a scenario in which there aren’t enough pixels
on the screen to represent each item. An analyst can
easily inspect a friendship network with 10-100 people,
but not on billion.

Business Intelligence & Analytics
Business Intelligence (BI) is a catchall phrase for ad
hoc reports created in a database. These are typically
pre-canned reports based on metrics that users
are comfortable reporting. “Analytics” includes any
computation performed for reporting. Hence, BI tools
are now called analytics. BI was created as a way to
extract data from the database. While it continues to
serve that purpose, it is time and labor intensive and is
not intended to surface insights.

w w w. e m c i e n . c o m
Limitations of Existing Tools
The overwhelming shortcoming of all these methods
is that they are query-based and labor-intensive.
Big Data offers an infinite number of queries, causing
all these methods to rely on analysts to produce
questions. Any method that puts the burden on the
user is a game-stopper.
Although search remains the go-to information access
interface, reliance on search needs to end. Search is
not enough. A new type of information-processing
focus is needed.

The major shortcomings of the existing tools are as follows:
•	Because search helps you discover insights you already know
about, it doesn’t help you discover things about which you’re
completely unaware.
•	Query-based tools are time-consuming because search-based
approaches require a virtually infinite number of queries.
•	Statistical methods are largely limited to numerical data; over
85% of data is unstructured.

w w w. e m c i e n . c o m

9
Emergence of Big Data Tools
Because Big Data includes data sets with sizes beyond the ability of commonly-used software tools to
capture, curate, manage, and process within a tolerable elapsed time, new technologies are emerging to
address the challenges brought on by these large quantities of data. These technologies can be categorized
into two groups: Hadoop-based solutions and In-Memory based solutions.

Hadoop and Hadoop-based Tools
While Hadoop is not an analytics tool per se, it is
often confused as being one. Apache Hadoop is
an open-source software framework that supports
data-intensive distributed applications. It supports
the running of applications on large clusters of
commodity hardware.
Hadoop is used to break big tasks into smaller
ones so that they can be run in parallel to gain
speed and efficiency. This is great for a query on
a large volume data set. The data set can be cut
into smaller pieces, and the same query can be run
on each smaller set. Hadoop aims to lower costs
by storing data in chunks across many inexpensive
servers and storage systems. The software can
help speed up certain types of simple calculations
by sending many queries to multiple machines at
the same time. The technology has spawned a set
of new start-ups, such as Hortonworks Inc. and
Cloudera Inc., which help companies implement it.
Hadoop helps companies store large amounts of
data but doesn’t provide critical insights based on
the naturally occurring connections within the data.
The impulse to store lots of data because it is cheap
to do so can lead to storing too much data, which
can make answering simple questions more difficult.

10

IT professionals and analysts are asking the
following questions:
•	 Where is the insight? What is the data telling us?
•	 How can I prove the return of investment?
As a result, in-memory databases are gaining
attention in an attempt to come closer to the goal
of real-time business processing. 3,4

In-Memory-based Appliance
Some of these approaches have been around for a
long time in areas such as telecommunications or
fields related to embedded databases. An example
is SAP’s HANA (High Performance Analytical
Appliance). This in-memory paradigm is now touted
as the future database paradigm for Big Data.
The primary limitation of in-memory is cost and
size. There is a significant limit to the amount of
data that can be held in memory. If you need to
perform Big Data–style analysis and you want to
see the bigger picture, in-memory is not enough.
The cost is prohibitive and it is not sustainable.

w w w. e m c i e n . c o m
The Need for a New Big Data Analytics Approach
While these emerging technologies are attempting to address the challenge of Big Data, at the end of the day,
they are heavy-handed and time-consuming because they lack automated intelligence for gaining insight. It’s
time for an entirely new approach.
This new approach demands a paradigm shift that focuses on the following:
•	 A fundamental change in the role played by analysts from data-miners to insight-evaluators.
•	 Fast and efficient algorithms that automatically convert data to insight for evaluation.
•	 Continual improvement of these algorithms to keep up with the speed of data and critical need for
timely insights.

Old Paradigm:

New Paradigm:

Data Analyst Digs for Insights by
Manually Querying a Database

Algorithms Automatically Surface
Insights to Evaluate

Analysis Takes From
Months to Years

Automatic Insights in
Seconds to Minutes

Specialized Skills in Math
and Computer Science Required

Anyone
(No Specialized Skills Required)

Operational and Business Intelligence

Immediate Insight and Perspective

w w w. e m c i e n . c o m

11
Emergence of Algorithms as a
New Class of Big Data Software Tools
The size and speed of Big Data demands true automation,
in which work is offloaded from human to machine. This
automation happens with algorithms, which are designed
for calculation, data processing, and automated reasoning.
Algorithms are designed for tasks that are beyond human
comprehension and require the speed of machines. This is
the realm of Big Data.
One of the most dramatic and game-changing examples of
an algorithm was designed by Alan Turing to automatically
decode German Navy messages at Bletchley Park during
WWII. In this instance, the urgency was critical and
demanded an automated approach to convert the data
to intelligence. There were 158 million million million
(158,000,000,000,000,000,000) possible ways that a
message could be coded by the German Enigma machine.
The decoder algorithm, Shark, worked and the results
changed the course of the war. The Allies won because they
had a competitive advantage. Bringing it to the present,
the use of Big Data in the 2012 United States presidential
election changed the face of political campaigns forever.
Emcien’s approach to Big Data is to automate the process
of data-to-insight in a timely and cost effective manner
through sophisticated algorithms. The algorithms leverage
advanced mathematics to solve complex problems of
an unimaginable size, thereby pushing the frontier of
innovation and competition. The following section details
Emcien’s algorithmic approach to Big Data Analytics.

12

w w w. e m c i e n . c o m
Emcien’s Algorithmic Approach to Big Data
Rather than search and crunch data, organizations need the ability to analyze, visualize and ultimately leverage
the patterns and connections within their data. Emcien’s innovation is a suite of automatic pattern detection
algorithms. These algorithms utilize a graph data model that captures the interconnectedness of the data
elements and creates a very elegant representation of high volume data with unknown structure. These fast,
sophisticated algorithms automatically detect patterns and self-organize what they find thereby providing
immediate insight and perspective.

Here is an outline of how the algorithm works:
1.	Assesses the data in order to identify and measure connections between data points.
2.	Converts the original high density/low value (structured, semi-structured or unstructured) data into a low
density/high value graph.
Structured

If the data is structured:
•	 Each row in the data table is considered an event.
•	 Each cell in the row is converted to a node in the graph.
•	 Cells that co-occur in a row (event) are connected by an arc.

Unstructured

If the data is unstructured:
•	 Every word or data element is converted to a node.
•	 Two words or data elements that occur simultaneously in an event are
connected by an arc.
•	 An event may be defined as a single document, message, email exchange, etc.

3.	Builds the graph on non-Euclidean distances. This is important, as most of the data is unstructured and
non-numeric. The distances and strengths are computed in non-Euclidean space. (For example – you may
be closer to your family than your friends – but this is not Euclidean space!).
4.	Computes millions of data points across the graphs to enable patterns to emerge.
5.	Distills the noise, to allow the signal to emerge. This is made possible because the noise has patterns and
the algorithms are designed to detect these patterns.
6.	Enables the key topographical elements from the graph to emerge. The algorithm will then rank and focus
these elements.
7.	Categorizes these elements based on the application and output the insight.

w w w. e m c i e n . c o m

13
Understanding the make-up of Graphs based on Connections
The graph data model is very flexible and displays a distinct topography based on the density of connections. A
cross sectional view of the graph data model will typically expose the following layers Layer

Connectedness

Description

1

Very Noisy Connections

Typically the most highly prevalent data, this layer is composed of high
volume interactions that may be mundane and blatantly obvious.

2

Highly Connected Nodes

Lying just below the noise, this second layer is composed of the first
signal that is interesting. This layer exhibits distinct patterns based on
crowd behavior.

3

Weaker Connections

The third layer is a weaker signal and displays the non-obvious
connections. These relate to events that are less frequent and may be
connected in non-obvious ways.

4

The Faint Signal

Composed of very weak connections and interactions, this last layer is
of interest for security and surveillance. In many cases, this layer only
emerges when the data is very rich in entities, causing connections to
emerge in very non-obvious ways.

Advantages of Emcien’s Graph Data Model
The graph data model exhibits a topography that signifies relationships and connectedness in a way that is not
possible through any other method. Emcien’s algorithms have been designed to surface these patterns. Listed
below are a few key attributes that help describe the characteristics of the algorithms.
Attribute

Advantage

Software

A critical distinction of Emcien’s graph data model is that it is software.
Emcien’s software provides the computational engine with a data representation that
lends itself to high-speed computing. As a result, the software runs on typical commodity
computing environments.

Algorithmic
Layer

Compact
Representation
of Data

Data is big because the number of events can grow exponentially as the various entities are
continually interacting.

Noise
Elimination

14

Although some products on the market model the graph into the database layer or hardware
layer, they do not have an algorithmic layer, and, therefore, require the user to query the
systems based on the old data-inquiring paradigm. Algorithms automate the data analysis
process that is an absolute requirement for efficient Big Data analytics.

The graph data model can be thought of in terms of layers, based on the connectedness of
the data elements. The highly connected and noisy nodes are at the top layer, and the weak
connections lay buried deep in the graphs. The noisy connections can be overwhelming
and tend to render graph models as burdensome. Emcien utilizes a suite of patented
algorithms to automatically distance the noise and detect critical patterns that relate to
highly significant and relevant information.

The graph representation is ideal for Big Data because it creates a very compact
representation of the data. This is because the number of entities grows more slowly
and reaches a natural steady state. In the graph data model these interactions translate
to connection weights, allowing the graph model to encapsulate very big data in
smaller structures.

w w w. e m c i e n . c o m
Not Just Theory: Solving Real-World Problems
The representation and visualization of complex networks as graphs helps surface critical, time-sensitive
intelligence. One of the most important tasks in graph analysis is to identify closely connected network
components comprising nodes that share similar properties. Detecting communities is of significant value
in retail, healthcare, banking and intelligence work - verticals where loosely federated communities deliver
insight and intelligence into the profile of a customer base or any other group being analyzed.

How Can This Model Be Applied? Emcien’s Pattern Detection Engine:
•	 Intelligence: Surfaces critical correlations
between people that merit serious attention,
determines key individuals in targeted social
networks, and geo-locates persons of interest
and their networks around the world – from
gangs to terrorists.
•	 Network Security: Auto-detects intrusion patterns
and surfaces suspicious activity by providing
immediate insight into highly linked variables. It
then automatically identifies anomalies without
the user having to query the data. For example,
Emcien analyzes millions to billions of transactions
to identify patterns in source and destinationIP addresses, ports, days, times and activity – to
show you what you should be paying attention
to. Emcien eliminates over 95% of the noise and
identifies patterns that are “surprising” or that
deviate from the norm.

•	 Fraud Detection: Surfaces patterns in money
laundering and fraud by identifying groups of
customers, locations, or transaction types that
occur together in banking transactions,
•	 Customer Analytics: Surfaces insights on customer
buying patterns, locations, demographics, loyalty,
savings, lifestyle and insurance.
•	 Healthcare Analytics: Analyzes massive volumes
of clinical data on medications, allergies, medical
claims, pharmacy therapies, lab results, medical
records, clinician notes and more in order to
surface patterns.
•	 Performance and Operations Analytics: Analyzes
raw information about performance and operations
of every element of an organization which can be
interpreted to increase profitability or improve
customer service.

In short, Emcien tackles one of the biggest challenges with Big Data, namely “What are the right questions
to ask?” Emcien’s pattern-detection engine quickly discovers the value within massive data sets by making
connections between disparate, seemingly unrelated bits of information and by finding the highest-ranked of
these connections to focus on – which reveals time-sensitive, mission-critical insights.

w w w. e m c i e n . c o m

15
Conclusion
The Big Data Analytics revolution is underway. This
revolution is a historic and game-changing expansion of
the role that information plays in business, government
and consumer realms. To harness the power of this data
revolution, a paradigm shift is required. Organizations
must be able to do more than query their Big Data
stores; search is no longer enough.
Up until now in the history of data analysis, the objective
of queries was to find the signal in the noise. And it
worked because we had clear-cut business questions
and the size of the data was smaller, the data set was
more complete, and we usually knew what we were
looking for. We were playing in the realm of known
knowns and known unknowns. In the new world of Big
Data, it is now more important to know what to ignore.
Because unless you know what to ignore, you’ll never
get a chance to pay attention to what’s really important.
Using algorithms to first ignore the noise and then find
the insights is the way of the new world.

Extracting insight from Big Data requires analytics
methods that are fundamentally different from
traditional querying, mining, and statistical analysis
on small samples. Big Data is often noisy, dynamic,
heterogeneous, unstructured, inter-related and
untrustworthy.
The combinatorial explosion requires new methods
for finding insight in Big Data. The need for data
sophistication is due to economics and time-criticality.
As stated, manually laboring through thousands of
queries to find the ones that deliver insight is not an
efficient way to derive value from data.
Emcien’s technology provides a “Command Center”
for Big Data, automatically interpreting the data,
discovering patterns, identifying complex and significant
relationships, and surfacing the most relevant questions
that lead to the insights analysts need to know.

About Emcien Corp.
Emcien’s automatic pattern-detection engine converts data to actionable insight that organizations can use
immediately. Emcien breaks through time, cost and scale barriers that limit the ability to operationalize the
value of data for mission-critical applications. Our patented algorithms recognize what’s important, defocus
what’s not, evaluate all possible combinations and deliver the optimal results automatically. Emcien’s engine,
fueled by several highly competitive NSF grants and years of research at Georgia Tech and MIT, is delivering
unprecedented value to organizations across sectors that depend on immediate insight for success—banking,
healthcare, insurance, retail, Intelligence and others. Visit emcien.com to learn more.

16

w w w. e m c i e n . c o m
Sources
1.	 Christopher C. Shilakes and Julie Tylman, “Enterprise Information Portals”, Merrill Lynch, 16 November, 1998.
2.	 Michael Friendly, “Milestones in the history of thematic cartography, statistical graphics, and data visualization,” 2008.
3.	 J. Vascellaro, “Hadoop Has Promise but Also Problems,” The Wall Street Journal, 23 February 2012.
4.	 R. Srinivasan, “Enterprise Hadoop: Five Issues With Hadoop That Need Addressing,” blog, 28 May 2012.

Mais conteúdo relacionado

Mais procurados

Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesT.S. Lim
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data MiningIOSR Journals
 
Introduction to Big Data & Analytics
Introduction to Big Data & AnalyticsIntroduction to Big Data & Analytics
Introduction to Big Data & AnalyticsPrasad Chitta
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdfAkuhuruf
 
Big Data Use-Cases across industries (Georg Polzer, Teralytics)
Big Data Use-Cases across industries (Georg Polzer, Teralytics)Big Data Use-Cases across industries (Georg Polzer, Teralytics)
Big Data Use-Cases across industries (Georg Polzer, Teralytics)Swiss Big Data User Group
 
Snowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...Thomas Rones
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremAnthonyOtuonye
 

Mais procurados (20)

Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Data analytics
Data analyticsData analytics
Data analytics
 
Big Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFOBig Data and Analytics - 2016 CFO
Big Data and Analytics - 2016 CFO
 
Applications of Big Data Analytics in Businesses
Applications of Big Data Analytics in BusinessesApplications of Big Data Analytics in Businesses
Applications of Big Data Analytics in Businesses
 
Reports vs analysis
Reports vs analysisReports vs analysis
Reports vs analysis
 
A Survey on Data Mining
A Survey on Data MiningA Survey on Data Mining
A Survey on Data Mining
 
Big Data Challenges faced by Organizations
Big Data Challenges faced by OrganizationsBig Data Challenges faced by Organizations
Big Data Challenges faced by Organizations
 
Introduction to Big Data & Analytics
Introduction to Big Data & AnalyticsIntroduction to Big Data & Analytics
Introduction to Big Data & Analytics
 
elgendy2014.pdf
elgendy2014.pdfelgendy2014.pdf
elgendy2014.pdf
 
Big Data Use-Cases across industries (Georg Polzer, Teralytics)
Big Data Use-Cases across industries (Georg Polzer, Teralytics)Big Data Use-Cases across industries (Georg Polzer, Teralytics)
Big Data Use-Cases across industries (Georg Polzer, Teralytics)
 
The 25 Predictions About The Future Of Big Data
The 25 Predictions About The Future Of Big DataThe 25 Predictions About The Future Of Big Data
The 25 Predictions About The Future Of Big Data
 
Snowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big DataSnowball Group Whitepaper - Spotlight on Big Data
Snowball Group Whitepaper - Spotlight on Big Data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Paradigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the tableParadigm4 Research Report: Leaving Data on the table
Paradigm4 Research Report: Leaving Data on the table
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
A Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE TheoremA Model Design of Big Data Processing using HACE Theorem
A Model Design of Big Data Processing using HACE Theorem
 

Semelhante a Emcien overview v6 01282013

An introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingAn introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingPim Piepers
 
GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378Parag Kapile
 
An introduction to Data Mining
An introduction to Data MiningAn introduction to Data Mining
An introduction to Data MiningShobhita Dayal
 
Guide to big data analytics
Guide to big data analyticsGuide to big data analytics
Guide to big data analyticsGahya Pandian
 
Unit 1 Introduction to Data Analytics .pptx
Unit 1 Introduction to Data Analytics .pptxUnit 1 Introduction to Data Analytics .pptx
Unit 1 Introduction to Data Analytics .pptxvipulkondekar
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysisPoonam Kshirsagar
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
QuickView #3 - Big Data
QuickView #3 - Big DataQuickView #3 - Big Data
QuickView #3 - Big DataSonovate
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfDr. Radhey Shyam
 
The Power of Data: Understanding Supply Chain Analytics
The Power of Data: Understanding Supply Chain AnalyticsThe Power of Data: Understanding Supply Chain Analytics
The Power of Data: Understanding Supply Chain AnalyticsXeneta
 
My latest white paper
My latest white paperMy latest white paper
My latest white paperJason Rushin
 
Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Aditya205306
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmIRJET Journal
 

Semelhante a Emcien overview v6 01282013 (20)

An introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt ThearlingAn introduction to Data Mining by Kurt Thearling
An introduction to Data Mining by Kurt Thearling
 
365 Data Science
365 Data Science365 Data Science
365 Data Science
 
GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378GROUP PROJECT REPORT_FY6055_FX7378
GROUP PROJECT REPORT_FY6055_FX7378
 
An introduction to Data Mining
An introduction to Data MiningAn introduction to Data Mining
An introduction to Data Mining
 
Big data upload
Big data uploadBig data upload
Big data upload
 
Guide to big data analytics
Guide to big data analyticsGuide to big data analytics
Guide to big data analytics
 
An introduction to data mining
An introduction to data miningAn introduction to data mining
An introduction to data mining
 
Unit 1 Introduction to Data Analytics .pptx
Unit 1 Introduction to Data Analytics .pptxUnit 1 Introduction to Data Analytics .pptx
Unit 1 Introduction to Data Analytics .pptx
 
Data mining
Data miningData mining
Data mining
 
Data mining
Data miningData mining
Data mining
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
QuickView #3 - Big Data
QuickView #3 - Big DataQuickView #3 - Big Data
QuickView #3 - Big Data
 
KIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdfKIT-601 Lecture Notes-UNIT-1.pdf
KIT-601 Lecture Notes-UNIT-1.pdf
 
The Power of Data: Understanding Supply Chain Analytics
The Power of Data: Understanding Supply Chain AnalyticsThe Power of Data: Understanding Supply Chain Analytics
The Power of Data: Understanding Supply Chain Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
My latest white paper
My latest white paperMy latest white paper
My latest white paper
 
Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.Bda assignment can also be used for BDA notes and concept understanding.
Bda assignment can also be used for BDA notes and concept understanding.
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 

Último

Night 7k Call Girls Pari Chowk Escorts Call Me: 8448380779
Night 7k Call Girls Pari Chowk Escorts Call Me: 8448380779Night 7k Call Girls Pari Chowk Escorts Call Me: 8448380779
Night 7k Call Girls Pari Chowk Escorts Call Me: 8448380779Delhi Call girls
 
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort ServiceDelhi Call girls
 
SELECTING A SOCIAL MEDIA MARKETING COMPANY
SELECTING A SOCIAL MEDIA MARKETING COMPANYSELECTING A SOCIAL MEDIA MARKETING COMPANY
SELECTING A SOCIAL MEDIA MARKETING COMPANYdizinfo
 
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncrCall Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncrSapana Sha
 
VIP Chandigarh Call Girls Service 7001035870 Enjoy Call Girls With Our Escorts
VIP Chandigarh Call Girls Service 7001035870 Enjoy Call Girls With Our EscortsVIP Chandigarh Call Girls Service 7001035870 Enjoy Call Girls With Our Escorts
VIP Chandigarh Call Girls Service 7001035870 Enjoy Call Girls With Our Escortssonatiwari757
 
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...Delhi Call girls
 
Night 7k Call Girls Atta Market Escorts Call Me: 8448380779
Night 7k Call Girls Atta Market Escorts Call Me: 8448380779Night 7k Call Girls Atta Market Escorts Call Me: 8448380779
Night 7k Call Girls Atta Market Escorts Call Me: 8448380779Delhi Call girls
 
Film the city investagation powerpoint :)
Film the city investagation powerpoint :)Film the city investagation powerpoint :)
Film the city investagation powerpoint :)AshtonCains
 
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptxFactors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptxvemusae
 
Ready to get noticed? Partner with Sociocosmos
Ready to get noticed? Partner with SociocosmosReady to get noticed? Partner with Sociocosmos
Ready to get noticed? Partner with SociocosmosSocioCosmos
 
Top Call Girls In Charbagh ( Lucknow ) 🔝 8923113531 🔝 Cash Payment
Top Call Girls In Charbagh ( Lucknow  ) 🔝 8923113531 🔝  Cash PaymentTop Call Girls In Charbagh ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment
Top Call Girls In Charbagh ( Lucknow ) 🔝 8923113531 🔝 Cash Paymentanilsa9823
 
DickinsonSlides teeeeeeeeeeessssssssssst.pptx
DickinsonSlides teeeeeeeeeeessssssssssst.pptxDickinsonSlides teeeeeeeeeeessssssssssst.pptx
DickinsonSlides teeeeeeeeeeessssssssssst.pptxednyonat
 
Film show post-production powerpoint for site
Film show post-production powerpoint for siteFilm show post-production powerpoint for site
Film show post-production powerpoint for siteAshtonCains
 
Film show investigation powerpoint for the site
Film show investigation powerpoint for the siteFilm show investigation powerpoint for the site
Film show investigation powerpoint for the siteAshtonCains
 
Film show production powerpoint for site
Film show production powerpoint for siteFilm show production powerpoint for site
Film show production powerpoint for siteAshtonCains
 
Film show evaluation powerpoint for site
Film show evaluation powerpoint for siteFilm show evaluation powerpoint for site
Film show evaluation powerpoint for siteAshtonCains
 
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779Delhi Call girls
 
Website research Powerpoint for Bauer magazine
Website research Powerpoint for Bauer magazineWebsite research Powerpoint for Bauer magazine
Website research Powerpoint for Bauer magazinesamuelcoulson30
 
Night 7k Call Girls Noida Sector 120 Call Me: 8448380779
Night 7k Call Girls Noida Sector 120 Call Me: 8448380779Night 7k Call Girls Noida Sector 120 Call Me: 8448380779
Night 7k Call Girls Noida Sector 120 Call Me: 8448380779Delhi Call girls
 

Último (20)

Night 7k Call Girls Pari Chowk Escorts Call Me: 8448380779
Night 7k Call Girls Pari Chowk Escorts Call Me: 8448380779Night 7k Call Girls Pari Chowk Escorts Call Me: 8448380779
Night 7k Call Girls Pari Chowk Escorts Call Me: 8448380779
 
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort ServiceBDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
BDSM⚡Call Girls in Sector 76 Noida Escorts >༒8448380779 Escort Service
 
SELECTING A SOCIAL MEDIA MARKETING COMPANY
SELECTING A SOCIAL MEDIA MARKETING COMPANYSELECTING A SOCIAL MEDIA MARKETING COMPANY
SELECTING A SOCIAL MEDIA MARKETING COMPANY
 
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncrCall Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
Call Girls In Gurgaon Dlf pHACE 2 Women Delhi ncr
 
VIP Chandigarh Call Girls Service 7001035870 Enjoy Call Girls With Our Escorts
VIP Chandigarh Call Girls Service 7001035870 Enjoy Call Girls With Our EscortsVIP Chandigarh Call Girls Service 7001035870 Enjoy Call Girls With Our Escorts
VIP Chandigarh Call Girls Service 7001035870 Enjoy Call Girls With Our Escorts
 
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
Hire↠Young Call Girls in Hari Nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esco...
 
Night 7k Call Girls Atta Market Escorts Call Me: 8448380779
Night 7k Call Girls Atta Market Escorts Call Me: 8448380779Night 7k Call Girls Atta Market Escorts Call Me: 8448380779
Night 7k Call Girls Atta Market Escorts Call Me: 8448380779
 
Film the city investagation powerpoint :)
Film the city investagation powerpoint :)Film the city investagation powerpoint :)
Film the city investagation powerpoint :)
 
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptxFactors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
Factors-on-Authenticity-and-Validity-of-Evidences-and-Information.pptx
 
Ready to get noticed? Partner with Sociocosmos
Ready to get noticed? Partner with SociocosmosReady to get noticed? Partner with Sociocosmos
Ready to get noticed? Partner with Sociocosmos
 
Top Call Girls In Charbagh ( Lucknow ) 🔝 8923113531 🔝 Cash Payment
Top Call Girls In Charbagh ( Lucknow  ) 🔝 8923113531 🔝  Cash PaymentTop Call Girls In Charbagh ( Lucknow  ) 🔝 8923113531 🔝  Cash Payment
Top Call Girls In Charbagh ( Lucknow ) 🔝 8923113531 🔝 Cash Payment
 
DickinsonSlides teeeeeeeeeeessssssssssst.pptx
DickinsonSlides teeeeeeeeeeessssssssssst.pptxDickinsonSlides teeeeeeeeeeessssssssssst.pptx
DickinsonSlides teeeeeeeeeeessssssssssst.pptx
 
Film show post-production powerpoint for site
Film show post-production powerpoint for siteFilm show post-production powerpoint for site
Film show post-production powerpoint for site
 
Film show investigation powerpoint for the site
Film show investigation powerpoint for the siteFilm show investigation powerpoint for the site
Film show investigation powerpoint for the site
 
9953056974 Young Call Girls In Kirti Nagar Indian Quality Escort service
9953056974 Young Call Girls In  Kirti Nagar Indian Quality Escort service9953056974 Young Call Girls In  Kirti Nagar Indian Quality Escort service
9953056974 Young Call Girls In Kirti Nagar Indian Quality Escort service
 
Film show production powerpoint for site
Film show production powerpoint for siteFilm show production powerpoint for site
Film show production powerpoint for site
 
Film show evaluation powerpoint for site
Film show evaluation powerpoint for siteFilm show evaluation powerpoint for site
Film show evaluation powerpoint for site
 
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
Night 7k Call Girls Noida New Ashok Nagar Escorts Call Me: 8448380779
 
Website research Powerpoint for Bauer magazine
Website research Powerpoint for Bauer magazineWebsite research Powerpoint for Bauer magazine
Website research Powerpoint for Bauer magazine
 
Night 7k Call Girls Noida Sector 120 Call Me: 8448380779
Night 7k Call Girls Noida Sector 120 Call Me: 8448380779Night 7k Call Girls Noida Sector 120 Call Me: 8448380779
Night 7k Call Girls Noida Sector 120 Call Me: 8448380779
 

Emcien overview v6 01282013

  • 1. The Big Data Paradigm Shift: Insight Through Automation
  • 2. In this white paper, you will learn about: • Big Data’s combinatorial explosion • Current and emerging technologies • Automation as the new way to leverage insight within Big Data • An algorithmic approach to the Big Data revolution 2 w w w. e m c i e n . c o m
  • 3. CONTENTS 01 / EXECUTIVE SUMMARY page 5 02 / THE BIG DATA PARADIGM SHIFT page 6 03 / LANDSCAPE OF EXISTING METHODS AND TOOLS page 8 04 / EMERGENCE OF BIG DATA TOOLS page 10 05 / THE NEED FOR A NEW BIG DATA ANALYTICS APPROACH page 11 06 / EMCIEN’S ALGORITHMIC APPROACH TO BIG DATA page 13 07 / NOT JUST THEORY: SOLVING REAL-WORLD PROBLEMS page 15 08 / CONCLUSION page 16 w w w. e m c i e n . c o m 3
  • 4. ... a study of progress over a 15-year span on a benchmark production-planning task. Over that time, the speed of completing the calculations improved by a factor of 43 million. Of the total, a factor of roughly 1,000 was attributable to faster processor speeds. Yet a factor of 43,000 was due to improvements in the efficiency of software algorithms.” Martin Grotschel, a German scientist and mathematician White House Advisory Report DECEMBER 2010 4 w w w. e m c i e n . c o m
  • 5. Data used to be scarce, and tiny bits of it were extremely valuable. Tiny bits of it are still extremely valuable, but now there is so much data that finding the valuable bits can be extremely difficult.” Executive Summary Big Data promises greater insight, competitive advantage, and the possibility of solving problems that have yet to be imagined. These insights will come from software applications that automate the analytics process. While infrastructure may be an important building block, this paper focuses on the algorithms that will deliver the insights organizations need. This paper does the following: • Proposes a paradigm shift away from analysts’ one-to-one relationship with data toward a relationship with algorithms. • Explains the combinatorial explosion that makes Big Data Analytics impossible for old data analytics tools and methodologies. • Examines current and emerging technologies. • Suggests that methods which accommodate Big Data users in the same way that smaller data sets are handled are missing the potential of Big Data. • Proposes that rather than search and crunch data, organizations need the ability to automate the process of analyzing, visualizing and ultimately leveraging the insight within their data. • Introduces an algorithmic approach that provides an efficient, sustainable, automated way to delve into Big Data, detect patterns, and discover insights hidden within that data. w w w. e m c i e n . c o m 5
  • 6. The Big Data Paradigm Shift The Need for a Paradigm Shift Humankind has always possessed a love for data. We can do remarkable things with data and have built remarkable tools to collect, store, sift, sort, splice, dice, chart, report, predict, and visualize it. Data can change the way we perceive the world and how we interact with it. But the world is changing. Data used to be scarce, and tiny bits of it were extremely valuable. Tiny bits of it are still extremely valuable, but now there is so much data that finding the valuable bits can be extremely difficult. Big Data demands that organizations change the way they interact with data. In the past, analysts could stand by the data faucet and collect what was needed in a paper cup, but now data is the ocean in which they are floating. Paper cups are useless here. Why? Because the search for data is over. Data is everywhere. It creates noise. Now analysts are searching for the signal amidst the noise. They are looking for the important bits. And in this vast sea of information, that task is overwhelming. Current tools and methodologies are failing when it comes to finding the most critical information in a time-sensitive, cost-effective manner. The reason? Many of these emerging tools and technologies are trying to approach this challenge with new ways of doing the same old things. But a bigger cup is not what’s needed. That won’t solve it. What is needed is a completely new approach to Big Data Analytics. In this new approach, the only way to find the signal is to automate the process of data-to-insight conversion. And automation requires algorithms— fast, sophisticated, highly optimized algorithms. The volume of Big Data demands a change in the human relationship with data, from human to machine. The algorithms have to do the work, not the humans. In this brave new world, the machines and algorithms are the protagonists. The role of the analysts will be to select the best algorithms and approve the quality of results based on speed, quality and economics. 6 The Promise of Data and the Search for Insight Why is the world obsessed with data? Because the promise of data is insight. In the last few years, organizations have become exceptionally good at collecting data, and as the cost of storage has dropped, companies are now drowning in that “Big Data.” However, the business world has hit a wall where the amount of data available far exceeds the human capacity to process it. The amount of data also exceeds the capabilities of existing analytics and intelligence tools, which have served as mere data-shovels or pick axes in the search for the gold that is insight. Acquiring insight inevitably involves querying a database, or, most likely, several databases. Many analysts are mashing up data across data silos in an attempt to discover the connections between data points. For example, marketers are aggregating customer demographics, purchase data and social media data, while purchasers are aggregating supplier data with procurement and pricing data. This process produces a variety of data sets of different types and qualities. A simple query, for example, might ask for specific values within a subset of columns. So the real question becomes, “How many queries will it take to answer even one of these questions?” Consider how many queries one might make into even a small set of data, such as a table containing just ten columns: • If each column has 2 possible values, there are 59,048 possible queries. • If each column has 3 possible values, there are 1,048,575 possible queries. To think of it another way, a database with 100 columns and 6 choices per column yields more possible queries than there are atoms in the universe. w w w. e m c i e n . c o m
  • 7. The Limitations of Search and Query-based Approaches Data becomes unwieldy due to the number of rows or the number of columns, or both. Data mash-ups, described previously, create a lot of columns. High volume transactional systems have lots of rows or records. However, having millions of records is not the problem. The depth of the data—the number of rows—merely impacts processing time in a linear fashion and can be reduced with fast or parallel computing. Thus, executing a query is simple enough. The problem, however, is the width of the data because the query explodes exponentially based on the number of columns. Number of Queries As a result, the real task of extracting insight from data is formulating the right queries. And manually laboring through thousands of queries to find the ones that deliver insight is not an efficient way to derive value from data. Therefore, when it comes to Big Data, the big challenge is knowing the right query. 30 Billion Exponential Explosion of Queries 10 Billion 1 Billion 1 Million 0 2 4 6 8 10 12 14 Number of Variables per Column w w w. e m c i e n . c o m 7
  • 8. Landscape of Existing Methods and Tools Over 85% of all data is unstructured. 1 However, existing methods and tools are designed to analyze structured data. A high level categorization of analytics tools is critical to understanding the state of Big Data Analytics. Statistical Tool Kits The purpose of statistical analysis is to make inferences from samples of data, especially when data is scarce. In the era of Big Data, scarcity is not the problem. Traditional statistical methods have severe limitations in the realm of Big Data for the following reasons: • Statistical methods break down as dimensionality increases. • In unstructured data, dimensions are not well defined. • Attempts to define dimension for unstructured data result in millions of dimensions. Data Mining Data mining is a catchall phrase for a very broad category of methods. Essentially, it is a method for sifting through very large amounts of data in attempt to find useful information. It implies “digging through tons of data” to uncover patterns and relationships contained within the business activity and history. Data mining involves manually slicing and dicing the data until a pattern becomes obvious or by using software that analyzes the data automatically. The first limitation of data mining is that the data has to be put in a structured format first, such as a database. The second limitation is that most forms of data mining require that the analyst knows what to look for. For example, in classification and clustering analysis, the analyst is trying to find instances of known categories, such as people who have a high probability of defaulting on their mortgages. In anomaly detection, the analyst is looking for instances that do not match the known normal patterns or known suspicious patterns, such as people who pay cash for one-way plane tickets. 8 The overwhelming shortcoming of all these methods is that they are query-based and labor intensive.” Data Visualization Data visualization is the study of the visual representation of data, meaning information that has been abstracted in some schematic form, including attributes or variables for the units of information.2 Humans are better equipped to consume visual data than text. As we know, a picture is worth a thousand words. While visualization tools are interesting, they rely on human evaluation to extract insight and knowledge. The more severe limitation of visualization is that the visuals can only focus on two or three dimensions at the most before the amount of information is overwhelming. The most common limitation of visualization is that while it is a good test for small samples, it is not a sustainable method to gain insight into large volumes of higher dimensionality data. Consider a scenario in which there aren’t enough pixels on the screen to represent each item. An analyst can easily inspect a friendship network with 10-100 people, but not on billion. Business Intelligence & Analytics Business Intelligence (BI) is a catchall phrase for ad hoc reports created in a database. These are typically pre-canned reports based on metrics that users are comfortable reporting. “Analytics” includes any computation performed for reporting. Hence, BI tools are now called analytics. BI was created as a way to extract data from the database. While it continues to serve that purpose, it is time and labor intensive and is not intended to surface insights. w w w. e m c i e n . c o m
  • 9. Limitations of Existing Tools The overwhelming shortcoming of all these methods is that they are query-based and labor-intensive. Big Data offers an infinite number of queries, causing all these methods to rely on analysts to produce questions. Any method that puts the burden on the user is a game-stopper. Although search remains the go-to information access interface, reliance on search needs to end. Search is not enough. A new type of information-processing focus is needed. The major shortcomings of the existing tools are as follows: • Because search helps you discover insights you already know about, it doesn’t help you discover things about which you’re completely unaware. • Query-based tools are time-consuming because search-based approaches require a virtually infinite number of queries. • Statistical methods are largely limited to numerical data; over 85% of data is unstructured. w w w. e m c i e n . c o m 9
  • 10. Emergence of Big Data Tools Because Big Data includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process within a tolerable elapsed time, new technologies are emerging to address the challenges brought on by these large quantities of data. These technologies can be categorized into two groups: Hadoop-based solutions and In-Memory based solutions. Hadoop and Hadoop-based Tools While Hadoop is not an analytics tool per se, it is often confused as being one. Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. It supports the running of applications on large clusters of commodity hardware. Hadoop is used to break big tasks into smaller ones so that they can be run in parallel to gain speed and efficiency. This is great for a query on a large volume data set. The data set can be cut into smaller pieces, and the same query can be run on each smaller set. Hadoop aims to lower costs by storing data in chunks across many inexpensive servers and storage systems. The software can help speed up certain types of simple calculations by sending many queries to multiple machines at the same time. The technology has spawned a set of new start-ups, such as Hortonworks Inc. and Cloudera Inc., which help companies implement it. Hadoop helps companies store large amounts of data but doesn’t provide critical insights based on the naturally occurring connections within the data. The impulse to store lots of data because it is cheap to do so can lead to storing too much data, which can make answering simple questions more difficult. 10 IT professionals and analysts are asking the following questions: • Where is the insight? What is the data telling us? • How can I prove the return of investment? As a result, in-memory databases are gaining attention in an attempt to come closer to the goal of real-time business processing. 3,4 In-Memory-based Appliance Some of these approaches have been around for a long time in areas such as telecommunications or fields related to embedded databases. An example is SAP’s HANA (High Performance Analytical Appliance). This in-memory paradigm is now touted as the future database paradigm for Big Data. The primary limitation of in-memory is cost and size. There is a significant limit to the amount of data that can be held in memory. If you need to perform Big Data–style analysis and you want to see the bigger picture, in-memory is not enough. The cost is prohibitive and it is not sustainable. w w w. e m c i e n . c o m
  • 11. The Need for a New Big Data Analytics Approach While these emerging technologies are attempting to address the challenge of Big Data, at the end of the day, they are heavy-handed and time-consuming because they lack automated intelligence for gaining insight. It’s time for an entirely new approach. This new approach demands a paradigm shift that focuses on the following: • A fundamental change in the role played by analysts from data-miners to insight-evaluators. • Fast and efficient algorithms that automatically convert data to insight for evaluation. • Continual improvement of these algorithms to keep up with the speed of data and critical need for timely insights. Old Paradigm: New Paradigm: Data Analyst Digs for Insights by Manually Querying a Database Algorithms Automatically Surface Insights to Evaluate Analysis Takes From Months to Years Automatic Insights in Seconds to Minutes Specialized Skills in Math and Computer Science Required Anyone (No Specialized Skills Required) Operational and Business Intelligence Immediate Insight and Perspective w w w. e m c i e n . c o m 11
  • 12. Emergence of Algorithms as a New Class of Big Data Software Tools The size and speed of Big Data demands true automation, in which work is offloaded from human to machine. This automation happens with algorithms, which are designed for calculation, data processing, and automated reasoning. Algorithms are designed for tasks that are beyond human comprehension and require the speed of machines. This is the realm of Big Data. One of the most dramatic and game-changing examples of an algorithm was designed by Alan Turing to automatically decode German Navy messages at Bletchley Park during WWII. In this instance, the urgency was critical and demanded an automated approach to convert the data to intelligence. There were 158 million million million (158,000,000,000,000,000,000) possible ways that a message could be coded by the German Enigma machine. The decoder algorithm, Shark, worked and the results changed the course of the war. The Allies won because they had a competitive advantage. Bringing it to the present, the use of Big Data in the 2012 United States presidential election changed the face of political campaigns forever. Emcien’s approach to Big Data is to automate the process of data-to-insight in a timely and cost effective manner through sophisticated algorithms. The algorithms leverage advanced mathematics to solve complex problems of an unimaginable size, thereby pushing the frontier of innovation and competition. The following section details Emcien’s algorithmic approach to Big Data Analytics. 12 w w w. e m c i e n . c o m
  • 13. Emcien’s Algorithmic Approach to Big Data Rather than search and crunch data, organizations need the ability to analyze, visualize and ultimately leverage the patterns and connections within their data. Emcien’s innovation is a suite of automatic pattern detection algorithms. These algorithms utilize a graph data model that captures the interconnectedness of the data elements and creates a very elegant representation of high volume data with unknown structure. These fast, sophisticated algorithms automatically detect patterns and self-organize what they find thereby providing immediate insight and perspective. Here is an outline of how the algorithm works: 1. Assesses the data in order to identify and measure connections between data points. 2. Converts the original high density/low value (structured, semi-structured or unstructured) data into a low density/high value graph. Structured If the data is structured: • Each row in the data table is considered an event. • Each cell in the row is converted to a node in the graph. • Cells that co-occur in a row (event) are connected by an arc. Unstructured If the data is unstructured: • Every word or data element is converted to a node. • Two words or data elements that occur simultaneously in an event are connected by an arc. • An event may be defined as a single document, message, email exchange, etc. 3. Builds the graph on non-Euclidean distances. This is important, as most of the data is unstructured and non-numeric. The distances and strengths are computed in non-Euclidean space. (For example – you may be closer to your family than your friends – but this is not Euclidean space!). 4. Computes millions of data points across the graphs to enable patterns to emerge. 5. Distills the noise, to allow the signal to emerge. This is made possible because the noise has patterns and the algorithms are designed to detect these patterns. 6. Enables the key topographical elements from the graph to emerge. The algorithm will then rank and focus these elements. 7. Categorizes these elements based on the application and output the insight. w w w. e m c i e n . c o m 13
  • 14. Understanding the make-up of Graphs based on Connections The graph data model is very flexible and displays a distinct topography based on the density of connections. A cross sectional view of the graph data model will typically expose the following layers Layer Connectedness Description 1 Very Noisy Connections Typically the most highly prevalent data, this layer is composed of high volume interactions that may be mundane and blatantly obvious. 2 Highly Connected Nodes Lying just below the noise, this second layer is composed of the first signal that is interesting. This layer exhibits distinct patterns based on crowd behavior. 3 Weaker Connections The third layer is a weaker signal and displays the non-obvious connections. These relate to events that are less frequent and may be connected in non-obvious ways. 4 The Faint Signal Composed of very weak connections and interactions, this last layer is of interest for security and surveillance. In many cases, this layer only emerges when the data is very rich in entities, causing connections to emerge in very non-obvious ways. Advantages of Emcien’s Graph Data Model The graph data model exhibits a topography that signifies relationships and connectedness in a way that is not possible through any other method. Emcien’s algorithms have been designed to surface these patterns. Listed below are a few key attributes that help describe the characteristics of the algorithms. Attribute Advantage Software A critical distinction of Emcien’s graph data model is that it is software. Emcien’s software provides the computational engine with a data representation that lends itself to high-speed computing. As a result, the software runs on typical commodity computing environments. Algorithmic Layer Compact Representation of Data Data is big because the number of events can grow exponentially as the various entities are continually interacting. Noise Elimination 14 Although some products on the market model the graph into the database layer or hardware layer, they do not have an algorithmic layer, and, therefore, require the user to query the systems based on the old data-inquiring paradigm. Algorithms automate the data analysis process that is an absolute requirement for efficient Big Data analytics. The graph data model can be thought of in terms of layers, based on the connectedness of the data elements. The highly connected and noisy nodes are at the top layer, and the weak connections lay buried deep in the graphs. The noisy connections can be overwhelming and tend to render graph models as burdensome. Emcien utilizes a suite of patented algorithms to automatically distance the noise and detect critical patterns that relate to highly significant and relevant information. The graph representation is ideal for Big Data because it creates a very compact representation of the data. This is because the number of entities grows more slowly and reaches a natural steady state. In the graph data model these interactions translate to connection weights, allowing the graph model to encapsulate very big data in smaller structures. w w w. e m c i e n . c o m
  • 15. Not Just Theory: Solving Real-World Problems The representation and visualization of complex networks as graphs helps surface critical, time-sensitive intelligence. One of the most important tasks in graph analysis is to identify closely connected network components comprising nodes that share similar properties. Detecting communities is of significant value in retail, healthcare, banking and intelligence work - verticals where loosely federated communities deliver insight and intelligence into the profile of a customer base or any other group being analyzed. How Can This Model Be Applied? Emcien’s Pattern Detection Engine: • Intelligence: Surfaces critical correlations between people that merit serious attention, determines key individuals in targeted social networks, and geo-locates persons of interest and their networks around the world – from gangs to terrorists. • Network Security: Auto-detects intrusion patterns and surfaces suspicious activity by providing immediate insight into highly linked variables. It then automatically identifies anomalies without the user having to query the data. For example, Emcien analyzes millions to billions of transactions to identify patterns in source and destinationIP addresses, ports, days, times and activity – to show you what you should be paying attention to. Emcien eliminates over 95% of the noise and identifies patterns that are “surprising” or that deviate from the norm. • Fraud Detection: Surfaces patterns in money laundering and fraud by identifying groups of customers, locations, or transaction types that occur together in banking transactions, • Customer Analytics: Surfaces insights on customer buying patterns, locations, demographics, loyalty, savings, lifestyle and insurance. • Healthcare Analytics: Analyzes massive volumes of clinical data on medications, allergies, medical claims, pharmacy therapies, lab results, medical records, clinician notes and more in order to surface patterns. • Performance and Operations Analytics: Analyzes raw information about performance and operations of every element of an organization which can be interpreted to increase profitability or improve customer service. In short, Emcien tackles one of the biggest challenges with Big Data, namely “What are the right questions to ask?” Emcien’s pattern-detection engine quickly discovers the value within massive data sets by making connections between disparate, seemingly unrelated bits of information and by finding the highest-ranked of these connections to focus on – which reveals time-sensitive, mission-critical insights. w w w. e m c i e n . c o m 15
  • 16. Conclusion The Big Data Analytics revolution is underway. This revolution is a historic and game-changing expansion of the role that information plays in business, government and consumer realms. To harness the power of this data revolution, a paradigm shift is required. Organizations must be able to do more than query their Big Data stores; search is no longer enough. Up until now in the history of data analysis, the objective of queries was to find the signal in the noise. And it worked because we had clear-cut business questions and the size of the data was smaller, the data set was more complete, and we usually knew what we were looking for. We were playing in the realm of known knowns and known unknowns. In the new world of Big Data, it is now more important to know what to ignore. Because unless you know what to ignore, you’ll never get a chance to pay attention to what’s really important. Using algorithms to first ignore the noise and then find the insights is the way of the new world. Extracting insight from Big Data requires analytics methods that are fundamentally different from traditional querying, mining, and statistical analysis on small samples. Big Data is often noisy, dynamic, heterogeneous, unstructured, inter-related and untrustworthy. The combinatorial explosion requires new methods for finding insight in Big Data. The need for data sophistication is due to economics and time-criticality. As stated, manually laboring through thousands of queries to find the ones that deliver insight is not an efficient way to derive value from data. Emcien’s technology provides a “Command Center” for Big Data, automatically interpreting the data, discovering patterns, identifying complex and significant relationships, and surfacing the most relevant questions that lead to the insights analysts need to know. About Emcien Corp. Emcien’s automatic pattern-detection engine converts data to actionable insight that organizations can use immediately. Emcien breaks through time, cost and scale barriers that limit the ability to operationalize the value of data for mission-critical applications. Our patented algorithms recognize what’s important, defocus what’s not, evaluate all possible combinations and deliver the optimal results automatically. Emcien’s engine, fueled by several highly competitive NSF grants and years of research at Georgia Tech and MIT, is delivering unprecedented value to organizations across sectors that depend on immediate insight for success—banking, healthcare, insurance, retail, Intelligence and others. Visit emcien.com to learn more. 16 w w w. e m c i e n . c o m
  • 17. Sources 1. Christopher C. Shilakes and Julie Tylman, “Enterprise Information Portals”, Merrill Lynch, 16 November, 1998. 2. Michael Friendly, “Milestones in the history of thematic cartography, statistical graphics, and data visualization,” 2008. 3. J. Vascellaro, “Hadoop Has Promise but Also Problems,” The Wall Street Journal, 23 February 2012. 4. R. Srinivasan, “Enterprise Hadoop: Five Issues With Hadoop That Need Addressing,” blog, 28 May 2012.