Finding the Right Data Solution for your Application in the Data Storage Haystack

Finding the Right Data Solution
for Your Application in the Data
Storage Haystack
Srinath Perera Ph.D.
Senior Software Architect, WSO2 Inc.
Visiting Faculty, University of Moratuwa
Research Scientist, Lanka Software Foundation

Data Models
§  There has been many data models
proposed (read Stonebraker’s
“What Goes Around Comes
Around” for more details)
o  Hierarchical (IMS): late 1960’s and
1970’s
o  Directed graph (CODASYL): 1970’s
o  Relational: 1970’s and early 1980’s
o  Entity-Relationship: 1970’s
o  Extended Relational: 1980’s
o  Semantic: late 1970’s and 1980’s
§  For last 20-30 years, Relational
Database systems (SQL) together
with transactions has been the
defacto data solution.
Copyright Greg Morss and licensed for reuse under CC License , http://www.geograph.org.uk/photo/990700

For many years, choice of data storage was
a easy one (use RDBMS)
Copyright by Alan Murray Walsh and licensed for reuse under CC License , http://www.geograph.org.uk/photo/1652880

Scale of Systems
§  However, the scale of systems
are changing due to
o  Increasing user bases of
systems.
o  Mobile devices, online presence
o  Cloud computing and multicore
systems
§  Scaling up RDBMS
o  Put it in a bigger machine
o  Replicate (Cluster) the database to 2-3 more nodes. But the
approach does not scale up.
o  Partition the data across many nodes (distribute, a.k.a.
shredding). However, JOIN queries across many nodes are hard,
and sometimes too slow. This often needs custom code and
configurations. Also transactions do not scale as well.

Copyright digitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/

CAP Theorem, Transactions, and Storage
§  RDBMS model provide two things
o  Relational model with SQL
o  ACID transactions – (Atomic,
Isolation, Consistent, Durable)
§  It was a classical one size fit all
solution, but it worked for a quite a
some time.
§  However, CAP theorem says that
you can not have it all.
o  Consistency, Availability and Partition
Tolerance, pick two!

§  But there are many usecases that do not need all RDBMS
features, when those are dropped, systems could scale. (e.g.
Google Big Table)
§  However, to use them, one has to understand and utilize the
application specific behavior.
Copyright stephcarter and licensed for reuse under CC License , http://www.flickr.com/photos/stephcarter/541464462

NoSQL and other Storage Systems
§  Large internet companies hit the problem first, they build
systems that are specific to their problems, and those
systems did scale.
o  Google Big table
o  Amazon Dynamo
§  Soon many others followed, and most of them are free and
open source.
§  Now there are couple of dozen
§  Among advantages of
NoSQL are
o  Scalability
o  Flexible schema
o  Designed to scale and support
fault tolerance out of the Box

Copyright ind{yeah} and licensed for reuse under CC License ,
http://www.flickr.com/photos/flickcoolpix/3566848458/

However, with NoSQL solutions, choosing a
data storage is no longer simple.
Copyright Philipp Salzgeber on and licensed for reuse under CC License http://
www.salzgeber.at/astro/pics/20081126_heart/index.html

Selecting the Right Data Solution

§  What are the right Questions to ask?
§  Categorize Answers for each question
§  Take different cases based on different answers and make
recommendations!

Copyright by Krzysztof Poltorak, and licensed for reuse under CC License.
http://www.fotocommunity.com/pc/pc/display/22077920

What are the right Questions?
o  Types of data
-  Structured, Semi-Structured,
Unstructured
o  Need for Scalability
-  Number of users
-  Number of data items
-  Size of files
-  Read/Write ratio
o  Types of Queries
-  Retrieve by Key
-  WHERE clauses
-  JOIN queries
-  Offline Queries
o  Consistency
-  Loose Consistency
-  Single Operation Consistency
-  Transactions

Copyright by romainguy, and licensed for reuse under CC License http://www.flickr.com/
photos/romainguy/249370084

Unstructured Data
§  Data do not have a particular
structure, often retrieved
through a key (name).
o  E.g. File systems.
§  Humans are good in processing
unstructured data, but
computers do not.

§  This data are often stored in storage but consumed by humans
at the end of the pipeline. (e.g. Document repository)
§  One common use case is building structured data from
unstructured data
§  Often associate Metadata to help searching

Copyright Martyn Gorman and licensed for reuse under CC License, http://www.geograph.org.uk/photo/294134

Structured Data
§  Have a structure and often described through a Schema
§  Often a table like 2D structure is used, but other structures
also possible.
§  Main advantage of the structure is search

§  Schema can be provided at
the deployment time or at the
runtime (dynamic schema)
§  Schema can be used to
o  Validate data
o  Support user friendly search
o  Optimize storage and queries

Copyright Marion Doss by and licensed for reuse under CC License , http://www.flickr.com/
photos/ooocha/2611398859/

Semi-structured Data
§  Structure is not fully defined.
But there is some inherent
structure.
§  For example
o  XML documents, data are
stored in a tree like structure
o  Graph data
o  Data structures like lists and
arrays
§  Support queries based on
structure
§  But processing data often
needs custom code.

Copyright Walter Baxter http://www.geograph.org.uk/photo/1069339

Search
§  Unstructured Data – no structure to support search.
o  Search based on an reverse index
o  Search through Properties
§  Semi-Structured Data
o  To search XML, Xpath or XQuery (Any tree like structure).
o  Tuple spaces can be queried through tuple space templates
o  Data registries can be searched for entries that matches with given
Metadata descriptions (search by properties)
o  Graph’s can be queried based on connectivity
§  Structured Data
o  Retrieve by Key
o  WHERE clauses
o  Queries with JOINs
o  Offline Queries

Copyright bydigitalART2 and licensed for reuse under CC License ,
http://www.flickr.com/photos/digitalart/2101765353/

Consistency and Scalability
§  Scalability – this is ability to
handle more users, data, or
larger files by adding more
nodes. We will have 3 categories.
o  Small systems (can handle with 1-3
nodes)
o  Scalable systems (can handle with
about 10 nodes)
o  Highly scalable systems (anything
larger, can be 100s or 1000s of Copyright NNSANews and licensed for reuse under CC
nodes) License , http://www.flickr.com/photos/nnsanews/
5347287260/
§  Consistency – this is how to keep the replicas of same data
in many nodes synced up (e.g. replicas) how they can be
updated without data corruptions. We will have 3 categories.
o  Transactional – series of operations updated in ACID manner
o  Atomic operation – single operation, updated in all replicas
o  Eventual consistency - data will be eventually consistent

Data Storage Implementations
§  Expectations from data
storages
o  Reliably store the data
o  Efficient search and retrieval
of data whenever needed
o  Data management – delete,
update data
Copyright John Atherton by and licensed for reuse under CC
License , http://www.flickr.com/photos/gbaku/2231332836/

Challenges of Data Storage
§  Reliability
o  Replicating data
o  Creating backup or recovering using backups
§  Security
§  Scaling and Parallel access
o  Distribution or replications
o  ACID transactions
§  Availability
o  Data replications
§  Vendor lock-in
o  Interoperability, standard query languages
§  Simple use experience
o  Hide the physical location of data,
o  Provide simple API and security models
o  Expressive query languages.

Data Storage Choices
Queries
Join Transactio Flexible
Storage Type Advantages Disadvantages Key Where s ns Scale schema

No unless
Local memory Very fast Not durable Yes No No STMs No Yes
Rigid schema,
good for read
oriented Moder
Relational/ SQL Standardized usecases. Yes Yes Yes Yes ate No
Column High write Not Yes,
families performance, transactional, secondar
(NoSQL ) replicated no-online joins Yes y index No No High Yes
High write Not
Documents performance, transactional, Yes,
DBs replicated no-online joins Yes views No No Yes Yes
Easy to integrate
with
Object Struct programming
Databases ured languages Yes Yes Yes Yes No No

Queries trans
Disadvanta action Flexible
Storage Type Advantages ges Key Search s Scale schema
No
structured
Save big files whose search on
Files format not understood content Yes Indexing No Moderate Yes
Data
Registries/ Metadata search Property
Metadata Unstru based search
Catalogs ctured Yes (Where) No Moderate Yes
Representation of flow
of messages over
Queues time/ Tasks Yes N/A No Yes Yes
Used to inference, very
Triple fast relationship Relationship
Stores processing Yes search No No Yes
XML XPath/
database XML native XQuery
Distributed
Cache Fast, replicated No search Yes No No Yes Yes
Model is too
simple in
some
High write cases, not
Key-value performance, transactiona
pairs replicated l Yes No No Yes Yes
Semi- Very fast joins, natural
structur to represent Not very
Graph DBs ed relationships, scalable Yes Graph Search Yes Low N/A

Choosing the Right
Data Solution

How do We do this?

Copyright 8664 and
licensed for reuse
under CC License ,
http://www.flickr.com/
photos/
80464769@N00/186
598462/

§  Consider structured, semi-structured, and unstructured
separately.
o  Then drill down based on other 3 properties: scale, consistency,
and search.
§  Structured case is more complicated, other two are bit
simpler.
§  Start by giving a defacto for each case

Handling Structured Data
§  There are three main considerations: scale, consistency
and queries
Small (1-3 nodes) Scalable (10 nodes) Highly Scalable (1000s
nodes)
Loose Operat ACID Loose Operat ACID Loose Operat ACID
Consist ion Transa Consi ion Transa Consi ion Transa
ency Consi ctions stency Consi ctions stency Consi ctions
stency stency stency
Primary DB/ KV/ DB/ DB KV/CF KV/CF Partitio KV/CF KV/CF No
Key CF KV/ CF ned
DB?
Where DB/ CF/ DB/ DB CF/ CF/ Partitio CF/ CF/ No
Doc CF/ Doc(?) Doc (?) ned Doc Doc
Doc DB?
JOIN DB DB DB ?? ?? ?? No No No

Offline DB/CF/ DB/CF/ DB/CF/ CF/ CF/ No CF/ CF/ No
Doc Doc Doc Doc Doc Doc Doc

*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems

Handling Small Scale Systems (1-3 nodes)
Small (1-3 nodes) §  In general using DB here for
every case might work.
Loose Operati ACID
Consi on Transa §  Reason for using options
stency Consist ctions other than DB
ency o  When there is potential need
Primary DB/ DB/ KV/ DB to scale later.
Key KV/ CF CF o  High write throughput
Where DB/ DB/ DB §  KV is 1-D where as other two
CF/ CF/Doc
Doc
are 2D
JOIN DB DB DB

Offline DB/ DB/CF/ DB/CF/
CF/ Doc Doc
Doc

*KV: Key-Value Systems, CF: Column
Families, Doc: document based
Systems

Handling Scalable Systems
Scalable (10 nodes) §  KV, CF, and Doc can easily
handle this case.
Loose Operati ACID §  If DBs used with data shredded
Consi on Transa
stenc Consist ctions across many nodes
y ency o  Transactions might work given that
Primary KV/CF KV/CF Partition participants on one transaction are
Key ed DB? not too many.
Where CF/ CF/Doc Partition
o  JOINs might need to transfer too
Doc ed DB? much data between nodes.
o  Also should consider in Memory
JOIN ?? ?? Partition
ed DBs like Vault DB.
DB?? §  Offline mode will work.
Offline CF/ CF/Doc No §  Most systems let users choose
Doc
consistency, and loose
*KV-Key-Value Systems, CF-Column
consistency can scale more.
Families, Doc- document based Systems (e.g. Cassandra)

Highly Scalable Systems

§  Transactions do not work in
Highly Scalable (1000s
nodes) this scale. (CAP theorem).
Loose Operati ACID §  Same for JOINs. The problem
Consis on Transac is sometime too much data
tency Consist tions
ency needs to be transferred
Primary KV/CF KV/CF No
between nodes to perform the
Key JOIN.
Where CF/Doc CF/Doc No §  Offline case handled through
Map-Reduce. Even JOIN
JOIN No No No case is OK since there is
time.
Offline CF/Doc CF/Doc No

*KV: Key-Value Systems, CF: Column
Families, Doc: document based
Systems

Highly Scalable Systems + Primary Key Retrieval

Highly Scalable (1000s §  This is (comparatively) the
nodes) easy one.
Loose Operat ACID §  Can be solved through
Consis ion Transa
tency Consis ctions
DHT (Distributed Hash
tency table) based solutions or
Primar KV/CF KV/CF No architectures like
y Key OceanStore.
Where CF/Doc CF/Doc No §  Both Key-Value storage
(?) (?)
(KV) and Column Families
JOIN No No No
(CF) can be used. But
Key-Value model is
Offline CF/Doc CF/Doc No
preferred as it is more
scalable.
*KV-Key-Value Systems, CF-Column
Families, Doc- document based
Systems

Highly Scalable systems + WHERE

Highly Scalable (1000s §  This Generally OK, but tricky.
nodes)
§  CF work through a Secondary
Loose Operat Transa
Consis ion ctions index that do Scatter-gather
tency Consis (e.g. Cassandra).
tency
§  Doc work through Map-
Primar KV/CF KV/CF No
y Key Reduce views (e.g.
Where CF/Doc CF/Doc No
CouchDB)
(?) (?) §  There is Bissa, which build a
JOIN No No No index for all possible queries
(No range queries)
Offline CF/Doc CF/Doc No §  If you are doing this, you
should do pilot runs and
*KV-Key-Value Systems, CF-Column make sure things work.
Families, Doc- document based
Systems

Handling Unstructured Data

§  Storage Options
o  Distributed File systems - generally scalable (e.g. NSF), but HDFS
(Hadoop) and Lustre are highly scalable versions.
o  Metadata registries (e.g. Niravana, SDSC Resource Broker)

Handling Semi-Structured Data
Small Scale (1-3 Scalable (10 nodes) Highly
nodes) Scalable
XML (Queried XML DB or convert XML DB or convert to a ??
through XPath) to a structured structured model
model
Graphs Graph DBs Graph DBs if graph can ??
be partitioned
Data Structures Data Structure
Servers, Object
Databases
Queues Distributed Distributed Queues Distributed
Queues Queues
!
§  Storage Options
o  Answer depends on the type of structure. If there is a server
optimized for a given type, it is often much more efficient than
using a DB. (e.g. Graph databases can support fast relationship
search)
§  Search
o  Very much custom. E.g. XML or any tree = Xpath, Graph can
support very fast relationship search

Hybrid Approaches
§  Some solutions have many types
of data and hence need more than
one data solution (hybrid
architectures).
§  For example
o  Using DB for transactional data and
CF for other data.
o  Keeping metadata and actual data
separate for large data archives.
o  Use GraphDB to store relationship
data while other data is in Column
Family storage. Copyright Matthew Oliphant by and licensed for

§  However, if transactions are reuse under CC License , http://www.flickr.com/
photos/fajalar/3174131216/

needed, transactions have to be
handled outside storage (e.g.
using Atomikos Zookeeper ).

Other parameters
§  Above list is not exhaustive, and there are other
parameters
o  Read/ Write ratio – when high it is easy to scale
o  High write throughput
o  Very large data products – you will need a file system. May be
keep metadata in Data registry and store data in a file system.
o  Flexible Schema
o  Archival usecases
o  Analytical usecases
o  Others …
§  So there is no silver bullet …

Conclusion
§  For last 20 years or so, DBMS were the de facto storage
solution
§  However, DBMS could not scale well, and many NoSQL
solutions have been proposed instead
§  As a results. it is no longer easy to find the best data
solution for your problem.
§  We discussed may dimensions (types of data, scalability,
queries, and consistency) and provided guidelines on when
to use which data solution.
§  Your feedback and thoughts are most welcome .. Contact
me through srinath@wso2.com

Finding the Right Data Solution for your Application in the Data Storage Haystack

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (10)

Semelhante a Finding the Right Data Solution for your Application in the Data Storage Haystack

Semelhante a Finding the Right Data Solution for your Application in the Data Storage Haystack (20)

Mais de DATAVERSITY

Mais de DATAVERSITY (20)

Último

Último (20)

Finding the Right Data Solution for your Application in the Data Storage Haystack