Determine the Right Analytic Database: A Survey of New Data Technologies

Determine the Right
Analytic Database:
A Survey of New
Data Technologies

O’Reilly Strata Conference
February 1, 2011

Mark R. Madsen
http://ThirdNature.net
Twitter: @markmadsen

Atomic Avenue #1 by Glen Orbik

Key Questions
▪ What technologies are available?
▪ What are they good for?
▪ How do you decide which to use?

But first: why are analytic databases available now?

Page 2

Consequences of Commoditization: Data Volume

Spimes

Chipping

Sensors Data
Generated
GPS

RFID You are
here

Time

An Unexpected Consequence of Data Volumes

Sums, counts and sorted results only get you so far.

An Unexpected Consequence of Data Volumes

Our ability to collect data is still outpacing our ability
to derive meaning from it.

Don’t worry about it. We’ll just buy more hardware.

CPUs, memory and
storage track to very
similar curves

RIP Moore’s Law: it nearly ground to a halt for
silicon integrated circuits about four years ago.

Technology Has Changed (a lot) But We Haven’t
1010
10 9
10,000 X improvement
Calculations per second per $1000

10 8
107
106
105
104
103
102
101
10
10‐1
01‐2 Current DW architecture
10‐3
and methods start here
10‐4
10‐5
in the mid-1980s
10‐6
Data: Ray Kurzweil, 2001

1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000
Mechanical Relay Vacuum tube Transistor Integrated circuit

Moore’s Law via the Lens of the Industry Analyst

CPU
Speed

Time

Moore’s Law: Power Consumption

Power
Use

Time 2017

Moore’s Law: Heat Generation

CPU
Temp

Time 2017

Conclusion #1: Your own nuclear reactor by 2017

Power
Use

Time 2017

Conclusion #2: You Will Need a New Desk in 2017

Power
Use

Time 2017

Problem: linear extrapolation

“If the automobile had followed Reality
the same development as the
computer, a Rolls-Royce would
today cost $100, get a million
miles per gallon, and explode
once a year killing everyone
inside.” Anything
Robert Cringely

Time

Multicore performance is not a linear extrapolation.

New Technology Evolution Means New Problems
1010
10 9
10 8 Massively
107
parallel era
106
105
104
Symmetric multi‐
103
102
processing era
101
Investment phase
10
Improving, perfecting, applying
10‐1 Uniprocessor
Core problems solved
01‐2 and custom
10‐3
CPU era
10‐4 Early engineering phase
10‐5 Exploring, learning, inventing
10‐6
1970 1980 1990 2000 2010 2020

Technology Maturity (time + engineering effort)

What’s different?
Parallelism
We’re not getting more CPU
power, but more CPUs.
There are too many CPUs
relative to other resources,
creating an imbalance in
hardware platforms.
Most software is designed
for a single worker, not
high degrees of parallelism
and won’t scale well.

Core problem: software is not designed for parallel work

Databases must be designed to permit local work with
minimal global coordination and data redistribution.

Storage Improvements
For data workloads, disk
throughput still key.
Improvements:
▪ Spinning disks at .05/GB
▪ Solid state disks remove
some latencies, read speed
of ~250MB/sec
▪ SSD capacity still rising
▪ Card storage (PCI), e.g.
FusionIO at 1.5GB/sec
▪ SSD is still costly at $2/GB
up to $30/GB

Compression Applied to Stored Data
10x compression means 1 disk I/O can read
10x as much data, stretching your current
hardware investment
But it eats CPU and
memory.
YMMV

Scale‐up vs. Scale‐out Parallelism
Uniprocessor environments required chip upgrades.
SMP servers can grow to a point, then it’s a forklift
upgrade to a bigger box.
MPP servers grow by adding mode nodes.

Database and Hardware Deployment Models
Three levels of software‐hardware integration:
▪ Database appliance (specialized hardware and software)
▪ Preconfigured (commodity) hardware with software
▪ Software on generic hardware
Then there are the hardware‐database parallel models:

Database DB DB Database

OS OS OS OS OS OS

Shared Everything Shared Disk Shared Nothing
Page 23

In‐Memory Processing
1. Maybe not as fast you think. Depends entirely on
the database (e.g. VectorWise)
2. So far, applied mainly to shared‐nothing models
3. Very large memories are more applicable to
shared‐nothing than shared‐memory systems

Box‐limited Limited by node scaling
e.g. 2 TB max e.g. 16 nodes, 512MB per = 8TB
4. Still an expensive way to get performance

Columnar Databases
In a row-store model
ID Name Salary these three rows
1 Marge Inovera $50,000 would be stored in
2 Anita Bath $120,000 sequential order as
shown here, packed
3 Nadia Geddit $36,000
into a block.

1 Marge Inovera $50,000 In a column store
2 Anita Bath $120,000 model database they
would be divided by
3 Nadia Geddit $36,000 columns and stored
in different blocks.
Not just changing the storage layout. Also involves changes to the
execution engine and query optimizer.

Column Stores Rule the TPC‐H Benchmark

Columnar Advantages and Disadvantages
+ Reduced I/O for queries not reading all columns
+ Better compression characteristics, meaning database
size < raw data size (unlike row store) and less I/O
+ Ability to operate on compressed data, improving
overall system performance
+ Less manual tuning
‐ Slower inserts and updates (causing ELT and trickle‐
feed problems*)
‐ Worse for small retrievals and random I/O
‐ Uses more system memory and CPU

Explosion of Analytic Techniques

Machine
learning
Visualization Statistics
GIS

Advanced
Analytic
Information Methods Numerical
theory & IR methods

Rules Text mining
engines & & text
constraint analytics
programming

Map‐Reduce is a parallel programming framework
that allows one to code more easily across a
distributed computing environment, not a database.

Ok, it’s not You write a Did you
So how do It’s not a a database distributed just tell me
I query the database, How do I mapreduce to go to
database? it’s a key- query it? function in hell? I believe I
value erlang. did, Bob.
store!

What’s Different
No database
No schema
No metadata
No query language*

Good for:
▪ Processing lots of complex
or non‐relational data
▪ Batch processing for very
large amounts of data
* Hive, Hbase, Pig, others

Using MapReduce / Hadoop
Hadoop is one implementation of MapReduce. There are
different variations with different performance and resource
characteristics e.g. Dryad, CGL‐MR, MPI variants
Hadoop is only part of the solution. You need more for
enterprise deployment. Cloudera’s distribution for Hadoop
shows what a complete environment could look like.

Image: Cloudera
31

How Hadoop fits into a traditional BI environment

Developers Analysts End Users

Development Analysis tools, BI BI, Applications
tools and IDEs

Data
Warehouse

File loads ETL

Databases Documents Flat Files XML Queues ERP Applications

Source Environments

NoSQL theoretically = “not only sql”, in reality…
Data stores that augment or replace relational access
and storage models with other methods.
Different storage models:
• Key‐value stores
• Column families
• Object / document stores
• Graphs
Different access models:
• SQL (rarely)
• programming API
• get/put
Reality: mostly suck for BI & analytics
Analytic DB vendors are coming from the other direction:
• Aster Data – SQL wrapped around MR
• EMC (Greenplum) – MR on top of the database 33

Some realities to consider
Cheap performance?
▪ Do you have 20 blades
lying around unused?
▪ How much concurrency?
▪ How much effort to write
queries? Debug them?
▪ Performance comparisons:
10x slower on the same
hardware?
The key is the workload type
and the scale of it.
Page 34

Do you really need a rack of blades for computing?
Graphics co‐processors have
been used for certain problems
for years.
Offer single‐system solution to
offload very large compute‐
intensive problems.
Order of magnitude cost
reduction, order of magnitude
performance increase with
current technology today (for
compute‐intensive problems).
We’ve barely started with this.

Other Options for analytic software deployment
The basic models.
1. Separate tools and systems
(MapReduce and nosql are a
simple variation on this theme)
2. Integrated with a database
3. Embedded in a database

The primary arguments about
deployment models center on
whether to take data to the
code or code to the data.

36

Leveraging the Database
Levels of database integration:
▪ Native DB connector
▪ External integration
▪ Internal integration
▪ Embedded

+ Less data movement
+ Possible dev process support
+ Hardware / environment
savings
+ Possible “sandboxing” support
‐ Limitations on techniques
37

In‐database Execution
You can do a lot with standards‐
compliant SQL
If the database has UDFs, you
can code too (but it’s harder)
Parallel support for UDFs varies
Some vendors build functions
directly into the database,
(usually scalar)
Iterative algorithms (ones that
converge on a solution) are
problematic, more so in MPP
38

What are factors in the decision?
User concurrency: one job or many
Repetition is a key element:
▪ Execute once and apply (build a response
or mortality model)
▪ Many executions daily (web cross‐sells)
In‐process or Batch?
▪ Batch and use results – segment, score
▪ In‐process reacts on demand – detect
fraud, recommend
In‐process requires thinking about how it
integrates with the calling application. (SQL
sometimes not your friend) 39

MATCHING THE PROBLEMS TO
TECHNOLOGIES

The problem of size is three problems of volume.

Computations!

Number
Amount of users!
of data!

Lots of H

“More” can become a qualitative rather than quantitative difference

Really lots of H

“Databases are dead!” – famous last words

Hardware Architectures and Deployment
Compute and data sizes are the key requirements
PF

MR and related
Computations
TF

Shared nothing
GF

Shared everything
PC or shared disk
MF

<10s GB 100s GB 1s TB 10s TB 100sTB PB
Data volume
45

Today’s reality, and true for a while in most businesses.
PF
Computations
TF
GF

The bulk of the
market resides here!
MF

Data volume
46

Today’s reality, and true for a while in most businesses.
PF

…but analytics
Computations

pushes many things
TF

into the MPP zone.
GF

The bulk of the
market resides here!
MF

Data volume
47

The real question: why do you want a new platform?
Trouble doing what you already do today
▪ Poor response times
▪ Not meeting availability deadlines
Doing more of what you do today
▪ Adding users, mining more data
Doing something new with your data
▪ Data mining, recommendations, embedded real‐time
process support

What’s desired is possible but limited by the cost of
supporting or growing the existing environment.
Page 48

The World According to Gartner: One Magical Quadrant
SQL Server 2008 R2 (PDW)
Official production customers?
EMC / Greenplum
SQL limitations
Memory / concurrency issues
Ingres
OLTP database
Illuminate
SQL limitations
Very limited scalability
Sun
MySQL for a DW, is this a joke? Magic Quadrant for Data Warehouse Database Management Systems

49

The assumption of the warehouse as a database is gone

Non-traditional Parallel
Message
data (logs, audio, programming
streams
documents) platforms

Traditional Streaming
Databases
tabular or DBs/engines
structured data

Data at rest Data in motion

Copyright Third Nature, Inc. 50 Slide 50

Data Access Differences
Basic data access styles:
▪ Standard BI and reporting
▪ Dashboards / scorecards
▪ Operational BI
▪ Ad‐hoc query and analysis
▪ Batch analytics
▪ Embedded analytics
Data loading styles:
▪ Refresh
▪ Incremental
▪ Constant

Evaluating ADB Options
Storage style:
▪ Files, tables, columns, cubes, KV
Storage type:
▪ Memory, disk, hybrid, compressed
Scaling model:
▪ SMP, clustered, MPP, distributed
Deployment model:
▪ Appliance, cloud, SaaS, on‐premise
Data access model:
▪ SQL, MapReduce, R, languages, etc.
License options:
▪ CPU, data size, subscription

Page 52

What’s it going to cost? A small sample at list:
Solution Pricing model Price/unit 1 TB solution Remarks

DatAupia Node $ 19,500/2TB $ 19,500 You can’t buy a 1
TB Satori server

Kickfire Data Volume $ 50,000,-/TB $ 50,000 Includes MySQL
(out of (raw) 5.1 Enterprise
business)

Vertica Data Volume $ 100,000/TB $ 200,000 Based on 5 nodes,
(raw) $ 20,000 each

ParAccel Data Volume $ 100,000/TB $ 200,000 Based on 5 nodes,
(raw) $ 20,000 each

EXASOL Data Volume $ 1,350/GB $ 350,000* Based on 4 nodes,
(active) (€1,000/GB) $ 20,000 each

Teradata Node $ 99,000 / TB $ 99,000** Based on 2550
base configuration

* 1TB raw ± 200 GB active, **realistic configuration likely 2x this price
53

Factors and Tradeoffs
The core tradeoff is not always
money for performance.

What else do you trade?
• Load time
• Trickle feeds
• New ETL tools
• New BI tools
• Operational complexity:
• Data integration and
management
• Backups
• Hardware maintenance
Page 54

The Path to Performance
1. Laborware – tuning
2. Upgrade – try to solve the
problem without changing
out the database
3. Extend – add an ADB or
Hadoop cluster to the
environment to offload a
specific workload
4. Replace – out with the old,
in with the new

Page 55

The Future
Assuming database market
embraces MPP, you have
compute power that exceeds
what the DB itself needs.
Why not execute the code at
the data?
Even without MPP, moving
to in‐database analytic
processing is a future
direction and is workable for
a large number of people.
57

Image Attributions
Thanks to the people who supplied the images used in this presentation:

Atomic Avenue #1 by Glen Orbik http://www.orbikart.com/gallery/displayimage.php?album=4&pos=5
spices.jpg ‐ http://flickr.com/photos/oberazzi/387992959/
Black hole galaxy ‐ http://www.flickr.com/photos/badastronomy/3176565627/
weaver peru.jpg ‐ http://flickr.com/photos/slack12/442373910/
rc toy truck.jpg ‐ http://flickr.com/photos/texas_hillsurfer/2683650363/
automat purple2.jpg ‐ http://flickr.com/photos/alaina/288199169/
open_air_market_bologna ‐ http://flickr.com/photos/pattchi/181259150/
bored_girl.jpg ‐ http://www.flickr.com/photos/alejandrosandoval/280691168/
path_vecchia.jpg ‐ http://www.flickr.com/photos/funadium/2320388358/
fast kids truck peru.jpg ‐ http://flickr.com/photos/zerega/1029076197/

What’s best for which types of problems?*
Shared nothing will be best for solving large data problems, regardless
of workload or concurrency.
Column‐stores will improve query response time problems for most
traditional query and aggregation workloads.
Row‐stores will be better for operational BI or embedded BI.
Fast storage always makes things better, but is only cost‐effective for
medium scale or smaller data.
Compression will help everyone, but column‐stores more than row
stores because of how the engines work.
Map‐Reduce and distributed filesystems offer advantages of a schema‐
less storage & analytic layer that can process into relational databases.
SMP and in‐memory will be better for high complexity problems under
moderate data scale, shared‐nothing and MR for large data scale.
*The answer is always “it depends”
Page 60

About the Presenter
Mark Madsen is president of Third
Nature, a technology research and
consulting firm focused on business
intelligence, analytics and
performance management. Mark is
an award-winning author, architect
and former CTO whose work has
been featured in numerous industry
publications. During his career Mark
received awards from the American
Productivity & Quality Center, TDWI,
Computerworld and the Smithsonian
Institute. He is an international
speaker, contributing editor at
Intelligent Enterprise, and manages
the open source channel at the
Business Intelligence Network. For
more information or to contact Mark,
visit http://ThirdNature.net.

About Third Nature

Third Nature is a research and consulting firm focused on new and
emerging technology and practices in business intelligence, data
integration and information management. If your question is related to BI,
open source, web 2.0 or data integration then you‘re at the right place.
Our goal is to help companies take advantage of information-driven
management practices and applications. We offer education, consulting
and research services to support business and IT organizations as well as
technology vendors.
We fill the gap between what the industry analyst firms cover and what IT
needs. We specialize in product and technology analysis, so we look at
emerging technologies and markets, evaluating the products rather than
vendor market positions.

Determine the Right Analytic Database: A Survey of New Data Technologies

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (10)

Semelhante a Determine the Right Analytic Database: A Survey of New Data Technologies

Semelhante a Determine the Right Analytic Database: A Survey of New Data Technologies (20)

Mais de mark madsen

Mais de mark madsen (20)

Último

Último (20)

Determine the Right Analytic Database: A Survey of New Data Technologies