One Size Doesn't Fit All: The New Database Revolution

One Size Doesn’t Fit All:
The New Database Revolution

Mark Madsen & Robin Bloor

Your Host

Eric.kavanagh@bloorgroup.com

Analysts Host

Bloor Madsen

Introduction
Significant and revolutionary changes are taking place
in database technology

In order to investigate and analyze these changes and
where they may lead, The Bloor Group has teamed up
with Third Nature to launch an Open Research
project.

This is the final webinar in a series of webinars and
research activities that have comprised part of the
project

All published research will be made available through
our web site: Databaserevolution.com

General Webinar Structure
Market Changes, Database Changes (Some Of The
Findings)

Workloads, Characteristics, Parameters

A General Discussion of Performance

How to Select A Database

Market Changes, Database
Changes

Database Performance Bottlenecks
CPU saturation

Memory saturation

Disk I/O channel saturation

Locking

Network saturation

Parallelism – inefficient load balancing

Cloud Hardware Architecture
• It’s a scale-out model. Uniform virtual node building
blocks.

• This is the future of software deployments, albeit with
increasing node sizes, so paying attention to early
adopters today will pay off.

• This implies that an MPP database architecture will be
needed for scale.

X

Multiple Database Roles

Now there are more...

Let’s Stop Using the Term NoSQL

As the graph indicates,
it’s just not helpful. In
fact it’s downright
confusing.

NoSQL Directions
Some NDBMS do not attempt to provide all ACID properties.
(Atomicity, Consistency, Isolation, Durability)

Some NDBMS deploy a distributed scale-out architecture with data
redundancy.

XML DBMS using XQuery are NDBMS.

Some documents stores are NDBMS (OrientDB, Terrastore, etc.)

Object databases are NDBMS (Gemstone, Objectivity, ObjectStore, etc.)

Key value stores = schema-less stores (Cassandra, MongoDB, Berkeley
DB, etc.)

Graph DBMS (DEX, OrientDB, etc.) are NDMBS

Large data pools (BigTable, Hbase, Mnesia, etc.) are NDBMS

The Joys of SQL?
SQL: very good for set manipulation.
Works for OLTP and many query
environments.
Not good for nested data structures
(documents, web pages, etc.)
Not good for ordered data sets
Not good for data graphs (networks of
values)

The “Impedance Mismatch”
The RDBMS stores data organized
according to table structures

The OO programmer manipulates data
organized according to complex object
structures, which may have specific
methods associated with them.

The data does not simply map to the
structure it has within the database

Consequently a mapping activity is
necessary to get and put data

Basically: hierarchies, types, result sets,
crappy APIs, language bindings, tools

The SQL Barrier
SQL has:
DDL (for data definition)
DML (for Select, Project and Join)
But it has no MML (Math) or TML
(Time)

Usually result sets are brought to
the client for further analytical
manipulation, but this creates
problems

Alternatively doing all analytical
manipulation in the database
creates problems

Hadoop/MapReduce
Hadoop is a parallel
processing environment

Map/Reduce is a parallel
processing framework

Hbase turns Hadoop
into a database of a kind

Hive adds an SQL
capability

Pig adds analytics

Market Forces
A new set of products appear

They include some fundamental innovations

A few are sufficiently popular to last

Fashion and marketing drive greater adoption

Products defects begin to be addressed

They eventually challenge the dominant products

Market forces affecting database choice
Performance: trouble doing what you already do today
▪ Poor response times
▪ Not meeting data availability requirements
Scalability: doing more of what you do today
▪ Adding users, processing more data
Capability: doing something new with your data
▪ Data mining, recommendations, real‐time
Cost or complexity: working more efficiently
▪ Consolidating / rehosting to simplify and reduce cost
What’s desired is possible but limited by the cost of
growing and supporting the existing environment.
Page 20

Relational has a good conceptual model, but a
prematurely standardized implementation

The relational database is the franchise technology for storing and
retrieving data, but…
1. Global, static schema model
2. No rich typing system
3. No concept of ordering, creating challenges with e.g. time series
4. Many are not a good fit for network parallel computing, aka cloud
5. Limited API in atomic SQL statement syntax & simple result set return
6. Poor developer support

Big data?

Unstructured data isn’t
really unstructured.
The problem is that this
data is unmodeled.
The real challenge is
complexity.

Text, Objects and Data Don’t Always Fit Together

So this is what
they meant by
“impedance
mismatch”

Many new choices, one way to look at them

http://blog.nahurst.com/visual-guide-to-nosql-systems

What About Analytics?

Machine
learning
Visualization Statistics
GIS

Advanced
Analytic
Information Methods Numerical
theory & IR methods

Rules Text mining
engines & & text
constraint analytics
programming

The holy grail of databases under current market hype

A key problem is that we’re
talking mostly about
computation over data when we
talk about “big data” and
analytics, a potential mismatch
for both relational and nosql.

Technologies are not
perfect replacements for
one another.

When replacing the old
with the new (or ignoring
the new over the old) you
always make tradeoffs,
and usually you won’t see
them for a long time.

Scalability and performance are not the same thing

Performance measures
Throughput: the number of
tasks completed in a given
time period
A measure of how much
work is or can be done by a
system in a set amount of
time, e.g. TPM or data
loaded per hour.
It’s easy to increase
throughput without
improving response time.
Page 29

Performance measures

Response time: the speed
of a single task
Response time is usually
the measure of an
individual's experience
using a system.
Response time =
time interval / throughput

Page 30

Scalability vs throughput vs response time

Scalability = consistent performance for a task over an
increase in a scale factor

Scale: Data Volume
The different ways people count
make establishing rules of thumb
for sizing hard.
How do you measure it?
▪ Row counts
▪ Transaction counts
▪ Data size
▪ Raw data vs loaded data
▪ Schema objects

People still have trouble scaling for
databases as large as a single PC
hard drive.

Scale: Concurrency (active and passive)

Scalability relationships
As concurrency
increases, response time
(usually) decreases,
This can be addressed
somewhat via workload
management tools.
When a system hits a
bottleneck, response
time and throughput will
often get worse, not just
level off.

Scale: Computational Complexity

A key point worth remembering:

Performance over size <> performance over complexity

Analytics performance is about the intersection of both.
Database performance for BI is mostly related to size and
query complexity.

Size, computational complexity and concurrency are the three
dimensions that constrain a product’s performance. Workloads fall
somewhere along all three.

Solving Your Problem Depends on the Diagnosis

Three General Workloads
Online Transaction Processing
▪ Read, write, update
▪ User concurrency is the common performance limiter
▪ Low data, compute complexity
Business Intelligence / Data warehousing
▪ Assumed to be read‐only, but really read heavy, write heavy,
usually separated in time
▪ Data size is the common performance limiter
▪ High data complexity, low compute complexity
Analytics
▪ Read, write
▪ Data size and complexity of algorithm are the limiters
▪ Moderate data , high compute complexity

Types of workloads
Write‐biased: Read‐biased:
▪ OLTP ▪ Query
▪ OLTP, batch ▪ Query, simple retrieval
▪ OLTP, lite ▪ Query, complex
▪ Object persistence ▪ Query‐hierarchical /
▪ Data ingest, batch object / network
▪ Data ingest, real‐time ▪ Analytic

Mixed
Inline analytic execution, operational BI

Technology choice depends
on workload & need
Optimizing for:
▪ Response time?
▪ Throughput?
▪ both?
Concerned about rapid growth
in data?
Unpredictable spikes in use?
Extremely low latency (in or
out) requirements?
Bulk loads or incremental
inserts and/or updates?

Important workload parameters to know
• Read‐intensive vs. write‐intensive

• Mutable vs. immutable data

• Immediate vs. eventual consistency

• Short vs. long access latency

• Short vs. long data latency
• Predictable vs. unpredictable data access patterns

• Short vs. long data latency
• Predictable vs. unpredictable data access patterns
• Simple vs. complex data types

You must understand your
workload mix ‐ throughput
and response time
requirements aren’t enough.
▪ 100 simple queries accessing
month‐to‐date data
▪ 90 simple queries accessing
month‐to‐date data and 10
complex queries using two
years of history
▪ Hazard calculation for the
entire customer master
▪ Performance problems are
rarely due to a single factor.

Selectivity and number of columns queried
Row store or column store, indexed or not?

Chart from “The Mimicking Octopus: Towards a one-size-fits-all Database Architecture”, Alekh Jindal

Characteristics of query workloads

Workload Selectivity Retrieval Repetition Complexity
Reporting / BI Moderate Low Moderate Moderate
Dashboards / Moderate Low High Low
scorecards
Ad‐hoc query and Low to Moderate Low Low to
analysis high to low moderate
Analytics (batch) Low High Low to High Low*
Analytics (inline) High Low High Low*
Operational / High Low High Low
embedded BI

* Low for retrieving the data, high if doing analytics in SQL

Characteristics of read‐write workloads

Workload Selectivity Retrieval Repetition Complexity
Online OLTP High Low High Low
Batch OLTP Moderate to Moderate High Moderate to
low to high high
Object High Low High Low
persistence
Bulk ingest Low (write) n/a High Low
Realtime ingest High (write) n/a High Low

With ingest workloads we’re dealing with write-only, so selectivity and
retrieval don’t apply in the same way, instead it’s write volume.

Workload parameters and DB types at data scale
Workload Write‐ Read‐ Updateable Eventual Un‐ Compute
parameters biased biased data consistency predictable intensive
ok? query path
Standard
RDBMS
Parallel
RDBMS
NoSQL (kv,
dht, obj)
Hadoop*

Streaming
database

You see the problem: it’s an
intersection of multiple

Problem: Architecture Can Define Options

A general rule for the read‐write axes

As workloads increase in both
intensity and complexity, we move
into a realm of specialized databases
adapted to specific workloads.

NewSQL

Read intensity

NoSQL
OldSQL

Write intensity

In general…
Relational row store databases for conventionally tooled
low to mid‐scale OLTP
Relational databases for ACID requirements
Parallel databases (row or column) for unpredictable or
variable query workloads
Specialized databases for complex data query workjloads
NoSQL (KVS, DHT) for high scale OLTP
NoSQL (KVS, DHT) for low latency read‐mostly data access
Parallel databases (row or column) for analytic workloads
over tabular data
NoSQL / Hadoop for batch analytic workloads over large
data volumes

How To Select A Database - (1)
1. What are the data management requirements and policies (if any) in respect
of:
- Data security (including regulatory requirements)?
- Data cleansing?
- Data governance?
- Deployment of solutions in the cloud?
- If a deployment environment is mandated, what are its technical
characteristics and limitations? Best of breed, no standards for
anything, “polyglot persistence” = silos on steroids, data integration
challenges, shifting data movement architectures
2. What kind of data will be stored and used?
- Is it structured or unstructured?
- Is it likely to be one big table or many tables?

3. What are the data volumes expected to be?
- What is the expected daily ingest rate?
- What will the data retention/archiving policy be?
- How big do we expect the database to grow to? (estimate a range).
4. What are the applications that will use the database?
- Estimate by user numbers and transaction numbers
- Roughly classify transactions as OLTP, short query, long query, long
query with analytics.
- What are the expectations in respect of growth of usage (per user) and
growth of user population?
5. What are the expected service levels?
- Classify according to availability service levels
- Classify according to response time service levels
- Classify on throughput where appropriate

6. What is the budget for this project and what does that cover?
7. What is the outline project plan?
- Timescales
- Delivery of benefits
- When are costs incurred?
8. Who will make up the project team?
- Internal staff
- External consultants
- Vendor consultants
9. What is the policy in respect of external support, possibly including vendor
consultancy for the early stages of the project?

10.What are the business benefits?
- Which ones can be quantified financially?
- Which ones can only be guessed at (financially)?
- Are there opportunity costs?

A random selection of databases
Sybase IQ, ASE EnterpriseDB Algebraix
Teradata, Aster Data LucidDB Intersystems Caché
Oracle, RAC Vectorwise Streambase
Microsoft SQLServer, PDW MonetDB SQLStream
IBM DB2s, Netezza Exasol Coral8
Paraccel Illuminate Ingres
Kognitio Vertica Postgres
EMC/Greenplum InfiniDB Cassandra
Oracle Exadata 1010 Data CouchDB
SAP HANA SAND Mongo
Infobright Endeca Hbase
MySQL Xtreme Data Redis
MarkLogic IMS RainStor
Tokyo Cabinet Hive Scalaris
And a few hundred more…

Product Selection
Preliminary investigation

Short-list (usually arrived at by elimination)

Be sure to set the goals and control the process.

Evaluation by technical analysis and modeling

Evaluation by proof of concept.

Do not be afraid to change your mind

Negotiation

Conclusion
Wherein all is revealed, or ignorance exposed

One Size Doesn't Fit All: The New Database Revolution

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a One Size Doesn't Fit All: The New Database Revolution

Semelhante a One Size Doesn't Fit All: The New Database Revolution (20)

Mais de mark madsen

Mais de mark madsen (20)

Último

Último (20)

One Size Doesn't Fit All: The New Database Revolution