Paradigm4 Research Report: Leaving Data on the table

Leaving Data on
the Table
Data Scientists Reveal Obstacles
to Big Data Analytics

Paradigm4 Data Scientist Survey 2
While Big Data enjoys widespread media coverage, not enough attention has been paid to what
practitioners think — data scientists who manage and analyze massive volumes of data.
We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists
for their help separating Big Data hype from reality. What we learned is that data scientists face multiple
challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data
— and money — on the table.
This survey uses the terms “complex analytics” and “basic analytics” for which respondents were given these definitions:
This distinction is important because basic analytics are “embarrassingly parallel” whereas complex analytics
are not. Here’s what we mean. “Embarrassingly Parallel” (sometimes referred to as “data parallel”) refers to problems that
can be separated into multiple independent sub-problems that can run in parallel and do not require access to all the data
at once. This is the divide-and-conquer approach used by MapReduce/Hadoop. In contrast, “non-embarrassingly parallel”
problems require using and sharing all the data at once and communicating intermediate results among processes.
Matrix multiplication on matrices too large to fit on one server is an example of a non-embarrassingly parallel function.
Their experiences should help inform businesses on what to look for as they investigate options to expand
their analytics infrastructure.
For insight on the issues and obstacles facing data scientists, read on.
We asked data scientists questions such as:
What obstacles prevent them from gaining insights into their data?
How many use Hadoop and which limitations have they encountered
when attempting to use Hadoop for complex analytics?
What data types and sources would they like to leverage more effectively?
Whether they’ll adopt complex analytics solutions (see below)
— and how quickly?
“Complex analytics” means math functions like covariance, clustering, machine learning, principal components
analysis and graph operations.
“Basic analytics” means business intelligence reporting such as sums, counts and aggregates.

We’ve all heard how hard it is to analyze massive and rapidly growing data volumes. But
data scientists say variety presents a bigger challenge. They are at times leaving data out
of their analyses as they wrestle with how to integrate and analyze more types of data such
as time-stamped sensor, location, image and behavioral data as well as network data.
Data scientists are turning to large-scale complex analytics both for unbiased data-
driven exploration and to wrest more value from their data.
For complex analytics, data scientists are forced to move large volumes of data
from existing data stores to dedicated mathematical and statistical computing
software. This time-consuming and coding-intensive step adds no analytical value
and impedes productivity.
While Hadoop has garnered widespread media coverage, 76 percent of data
scientists have encountered serious limitations using it. Hadoop is well suited for
embarrassingly-parallel problems but falls short for large-scale complex analytics.
Incorporating the diverse data types into analytical workflows is a major pain point
for data scientists using traditional relational database software.
For data scientists, Big Data means Big Stress. 39 percent say it’s made their job
more stressful.
1
2
3
4
5
6
The Big Takeaways

What Is The Biggest Problem You Face In
Gaining Insights From Your Big Data?
Which types of data do you anticipate using in the next year?
The overwhelming volume of corporate and organizational data continues to generate headlines but it’s the
diverse types of data that pose a bigger challenge. Nearly three-quarters of data scientists — 71 percent —
said Big Data had made their analytics more difficult and data variety, not just volume, was the challenge.
71%TRUE
I struggle with managing new types and sources of data
I know how to get the answer but it takes too long (my data is too big to move to a math/ analytics software package)
I don’t know what questions to ask of my data
I know what I want to ask but don’t know how to get the answers
Time-series
Business transaction
Geospatial / Location
Graph (network)
Clickstream
Health records
Sensor
Image
Genomic
I know how to get the answer but my analysis runs out of memory
29%
40%
36%
24%
18%
17%
66%
66%
55%
46%
35%
25%
17%
13%
7%
FALSE
My Analytics Are Becoming More Difficult Because of the Variety
and Types of Data Sources (Not Just the Volume)
Data Variety Is Proving to Be
More Important Than Volume

The trend toward hyper-personalization and precision targeting illustrates this well.
Recommendations, search results and ads are becoming ever more relevant and micro-targeted
as they tap more and diverse data like social networks, current location, and browsing and
purchasing history. Personalized insurance offerings are augmenting sensor data about driver
behaviortoincorporatecontextualdataliketime-of-dayandroadcongestion.Precisionmedicine
providers are gaining a more refined understanding of what works for whom by integrating
molecular data with clinical, behavioral, electronic health records and environmental data. But
the ability to use diverse data types poses a serious challenge. (For more on this topic, see, “Big
Data at Work: Dispelling the Myths, Uncovering the Opportunities,” by Thomas Davenport,
Chapter 1: “Why Big Data is Important to you and your Organization.”)
What It Means:
The ability to effectively use diverse data sources is proving to
be a competitive differentiator in many industries.

Data Scientists Are Turning to Complex
Analytics to Analyze Their Big Data
When will your company begin to use complex
analytics on your Big Data?
59%
1%
4%
4%
16%
W
e use it now
In
the next 3 years
M
ore than
3 years down
the road
No plans to use com
plex analytics
In
the next 2 years
W
eplantouseitinthenextyear
15%
The point is not to be dazzled by the volume of data,
but rather to analyze it — to convert it into insights,
innovations, and business value.
— Thomas Davenport, “Big Data at Work: Dispelling
the Myths, Uncovering the Opportunities,” page 2.
“
”

Many new analytical uses require significantly more powerful algorithms and computational
approaches than what’s possible in Hadoop or relational databases. Data scientists increasingly
need to leverage all data sources in novel ways, using tools and analytical infrastructures suitable
for the task. As we have already seen in this survey, organizations are moving from simple SQL
aggregates and summary statistics to next-generation analytics such as machine learning,
clustering, correlation, and principal components analysis on moderately sized data sets. The
move from simple to complex analytics on Big Data presages an emerging need for analytics
that scale beyond single server memory limits and handle sparsity, missing values and mixed
sampling frequencies appropriately. These complex analytics methods can also provide data
scientists with unsupervised and assumption-free approaches, letting all the data speak for itself.
What It Means:
The “low hanging fruit” of Big Data has been exploited.

Data scientists face another growing challenge: conventional analytic workflows require them to move data
to mathematical and statistical computing software. This workflow made sense with small or sampled data
but is either woefully inefficient or breaks with even moderately large data volumes.
of data scientists utilize software capable of
complex analytics in addition to their data
management software
of data scientists say it takes too long to get
insights from their data because it is too
big to move to their analytics software
Moving Big Data Poses Difficult
Challenges to Data Scientists
78%
36%

This forces data scientists to make compromises, analyzing samples instead of the whole
data set, leaving data and money on the table. Data scientists risk missing rare events, weak
signals or important anomalies when restricted to working with samples or computing on
subsets independently. (For more on this topic, see “Scaling Big Data Mining Infrastructure:
The Twitter Experience,” by Twitter Engineering Manager Dmitriy Ryaboy and University of
Maryland Associate Professor Jimmy Lin). What’s needed are tools capable of conducting
complex analytics over massive data volumes efficiently — without sampling and without
moving the data.
What It Means:
The size and diversity of today’s data sets pose a significant hurdle
to doing more sophisticated analytics because so much time is lost
moving data from files or from a database to analysis tools.

While the Hadoop software platform garners significant media attention, Hadoop is not a viable solution
for many use cases, especially those that require complex analytics. Fewer than half of data scientists
surveyed (48 percent) have used Hadoop or SPARK — and of those, 76 percent cited significant limitations
to its use.
Hadoop Only Takes You So Far
From the 76% reporting problems, what are the limitations of Hadoop / SPARK?
It takes too much effort to program
It’s too slow for interactive, ad-hoc queries
It’s too slow for real-time analytics
It’s not well-suited for my analytics (not embarrassingly parallel)
39%
37%
30%
22%
of data scientists who tried Hadoop or
SPARK have stopped using it
35%

But even Hadoop vendors have recognized the limitations. They are adding SQL functionality to
theirproductstoaccommodatedatascientists’preferenceforahigher-levelquerylanguageinstead
of programming languages like Java and to address the limitations of MapReduce. (E.g., Cloudera
has abandoned MapReduce and is offering Impala to provide SQL on HDFS.) A growing number of
complex analytics use cases are proving to be unworkable in Hadoop. First-wave Hadoop adopters
like Google, Facebook and LinkedIn required a small army of developers to program and maintain
Hadoop. But many organizations either don’t have the required staff or face complex analytics
challenges that can’t be readily solved with Hadoop. This presents a real challenge for the Hadoop
infrastructure that has to address these shortcomings or risk being replaced.
What It Means:
Hadoop was unrealistically hyped as a universal and
disruptive Big Data solution.

Given the growing diversification of data types and sources coupled with the limitations of existing relational
databases, it’s no surprise that many data scientists are frustrated leveraging these data sources in their
analytical workflows.
Existing relational database management systems are
inadequate for analyzing the variety of data sources
I am finding it harder to fit my data into relational database tables
TRUE
FALSE
49%
51%

By comparison, temporal, spatial and network data may be quite sparse (containing
large amounts of missing values), have mixed sampling frequencies and a natural order.
Relational databases require predefined access patterns for each line of inquiry, an obvious
non-starter for data scientists doing ad hoc data exploration.
What It Means:
Relational databases were built for storing and querying densely
populated transactional data such as business purchases and
customer information.

of data scientists say the growth of Big Data has made
their job more stressful in the last year
say they don’t know which questions to ask of their Big Data
There’s another side of the Big Data story: 39 percent of data scientists say their job has become more
stressful with the growth of Big Data. That’s nearly four times the number who say it’s made their job
less stressful.
Big Data Means Big Stress for Data Scientists
Quotes from data scientists:
24%
My biggest problem is linking various data sources.
”“
The data is just too big.
”“
The biggest problem is putting
multiple sources of data together.
”“
39%

Fulfilling those expectations falls on the data scientist. But outdated software approaches
better suited to traditional transactional data — not today’s diverse data sources and rapidly
growing volumes — often make it impossible to fulfill these expectations. It’s a recipe for
stress. Deriving business value from organizational data starts with ad hoc analysis. Tools and
workflows need to enable data scientists to conduct analysis quickly and efficiently, making
data scientists more productive and lowering stress levels as a result.
What It Means:
Driven in part by media hype, organizations have developed
inflated expectations around the value they’ll get out of Big Data.

Data scientists play a pivotal role helping organizations unlock the potential of their Big Data. But
current software tools fall short in some areas as indicated in the survey. Hype has exceeded reality
and data scientists are forced to compromise, sometimes leaving data on the table. Choosing the
right software solution is key but don’t expect to get there by browsing vendors’ websites. The fact
that so many data scientists identified shortcomings in their infrastructure suggests that the only way
to tell which solution is best suited to your organization is to do a pilot project using your data and
your use cases.
So What?
The Paradigm4 Data Scientist Survey was fielded by Innovation Enterprise, an independent research
firm, from March 27 to April 23, 2014. The responses were generated from a survey of 111 data
scientists in the U.S.
Paradigm4 is the creator of SciDB, a computational database management system used to solve
large-scale, complex analytics challenges on Big — and Diverse — Data. Led by industry visionaries
and veterans Michael Stonebraker, Marilyn Matz, Paul Brown and Bryan Lewis, Paradigm4 enables
data-obsessed organizations in life sciences, e-commerce, finance, and manufacturing to answer
harder questions faster.
For more information, visit www.paradigm4.com
About the Survey
About Paradigm4

Paradigm4 Research Report: Leaving Data on the table

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Paradigm4 Research Report: Leaving Data on the table

Semelhante a Paradigm4 Research Report: Leaving Data on the table (20)

Último

Último (20)

Paradigm4 Research Report: Leaving Data on the table