The NoSQL movement has rekindled interest in data storage solutions. A few years ago, within limited scale systems, storage choices for programmers and architects were simple where relational databases were almost always the choice. However, advent of Cloud and ever increasing user bases for applications have given rise to larger scale systems. Relational databases cannot always scale to meet the needs of those systems, and as an alternative, the NoSQL movement has proposed many solutions.
For a programmer who wants to select a data model, they now have to choose from a wide variety of choices like Local memory, Relational databases, Files, Distributed Cache, Column Family Storage, Document Storage, Name value pairs, Graph DBs, Service Registries, Queue, and Tuple Space etc. Furthermore, there are different layers/access choices such as directly accessing data, using object to relation mapping layer like hibernate/JPA, or using data services. Moreover, users also need to worry about how to scale up the storage in multiple dimensions like the number of databases, the number of tables, the amount of data in a table, frequency of requests, types of requests (read/write ratio).
Consequently, choosing the right data model for a given problem is no longer trivial, and such a choice needs a clear understanding of different storage offerings, their similarities, differences, as well as associated tradeoffs. We faced the same problem while designing the data interfaces for Stratos Platform as a Service (SaaS) offering, and in this talk, we would like to share our findings and experiences of that work. We will present a survey of different data models, their differences as well as similarities, tradeoffs, and killer apps for each model. We believe the participants will walk away with a border understanding about data models and guidelines on which model to be used when.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Finding the Right Data Solution for your Application in the Data Storage Haystack
1. Finding the Right Data Solution
for Your Application in the Data
Storage Haystack
Srinath Perera Ph.D.
Senior Software Architect, WSO2 Inc.
Visiting Faculty, University of Moratuwa
Research Scientist, Lanka Software Foundation
2. Data Models
§ There has been many data models
proposed (read Stonebraker’s
“What Goes Around Comes
Around” for more details)
o Hierarchical (IMS): late 1960’s and
1970’s
o Directed graph (CODASYL): 1970’s
o Relational: 1970’s and early 1980’s
o Entity-Relationship: 1970’s
o Extended Relational: 1980’s
o Semantic: late 1970’s and 1980’s
§ For last 20-30 years, Relational
Database systems (SQL) together
with transactions has been the
defacto data solution.
Copyright Greg Morss and licensed for reuse under CC License , http://www.geograph.org.uk/photo/990700
3. For many years, choice of data storage was
a easy one (use RDBMS)
Copyright by Alan Murray Walsh and licensed for reuse under CC License , http://www.geograph.org.uk/photo/1652880
4. Scale of Systems
§ However, the scale of systems
are changing due to
o Increasing user bases of
systems.
o Mobile devices, online presence
o Cloud computing and multicore
systems
§ Scaling up RDBMS
o Put it in a bigger machine
o Replicate (Cluster) the database to 2-3 more nodes. But the
approach does not scale up.
o Partition the data across many nodes (distribute, a.k.a.
shredding). However, JOIN queries across many nodes are hard,
and sometimes too slow. This often needs custom code and
configurations. Also transactions do not scale as well.
Copyright digitalART2 and licensed for reuse under CC License , http://www.flickr.com/photos/digitalart/2101765353/
5. CAP Theorem, Transactions, and Storage
§ RDBMS model provide two things
o Relational model with SQL
o ACID transactions – (Atomic,
Isolation, Consistent, Durable)
§ It was a classical one size fit all
solution, but it worked for a quite a
some time.
§ However, CAP theorem says that
you can not have it all.
o Consistency, Availability and Partition
Tolerance, pick two!
§ But there are many usecases that do not need all RDBMS
features, when those are dropped, systems could scale. (e.g.
Google Big Table)
§ However, to use them, one has to understand and utilize the
application specific behavior.
Copyright stephcarter and licensed for reuse under CC License , http://www.flickr.com/photos/stephcarter/541464462
6. NoSQL and other Storage Systems
§ Large internet companies hit the problem first, they build
systems that are specific to their problems, and those
systems did scale.
o Google Big table
o Amazon Dynamo
§ Soon many others followed, and most of them are free and
open source.
§ Now there are couple of dozen
§ Among advantages of
NoSQL are
o Scalability
o Flexible schema
o Designed to scale and support
fault tolerance out of the Box
Copyright ind{yeah} and licensed for reuse under CC License ,
http://www.flickr.com/photos/flickcoolpix/3566848458/
7. However, with NoSQL solutions, choosing a
data storage is no longer simple.
Copyright Philipp Salzgeber on and licensed for reuse under CC License http://
www.salzgeber.at/astro/pics/20081126_heart/index.html
8. Selecting the Right Data Solution
§ What are the right Questions to ask?
§ Categorize Answers for each question
§ Take different cases based on different answers and make
recommendations!
Copyright by Krzysztof Poltorak, and licensed for reuse under CC License.
http://www.fotocommunity.com/pc/pc/display/22077920
9. What are the right Questions?
o Types of data
- Structured, Semi-Structured,
Unstructured
o Need for Scalability
- Number of users
- Number of data items
- Size of files
- Read/Write ratio
o Types of Queries
- Retrieve by Key
- WHERE clauses
- JOIN queries
- Offline Queries
o Consistency
- Loose Consistency
- Single Operation Consistency
- Transactions
Copyright by romainguy, and licensed for reuse under CC License http://www.flickr.com/
photos/romainguy/249370084
10. Unstructured Data
§ Data do not have a particular
structure, often retrieved
through a key (name).
o E.g. File systems.
§ Humans are good in processing
unstructured data, but
computers do not.
§ This data are often stored in storage but consumed by humans
at the end of the pipeline. (e.g. Document repository)
§ One common use case is building structured data from
unstructured data
§ Often associate Metadata to help searching
Copyright Martyn Gorman and licensed for reuse under CC License, http://www.geograph.org.uk/photo/294134
11. Structured Data
§ Have a structure and often described through a Schema
§ Often a table like 2D structure is used, but other structures
also possible.
§ Main advantage of the structure is search
§ Schema can be provided at
the deployment time or at the
runtime (dynamic schema)
§ Schema can be used to
o Validate data
o Support user friendly search
o Optimize storage and queries
Copyright Marion Doss by and licensed for reuse under CC License , http://www.flickr.com/
photos/ooocha/2611398859/
12. Semi-structured Data
§ Structure is not fully defined.
But there is some inherent
structure.
§ For example
o XML documents, data are
stored in a tree like structure
o Graph data
o Data structures like lists and
arrays
§ Support queries based on
structure
§ But processing data often
needs custom code.
Copyright Walter Baxter http://www.geograph.org.uk/photo/1069339
13. Search
§ Unstructured Data – no structure to support search.
o Search based on an reverse index
o Search through Properties
§ Semi-Structured Data
o To search XML, Xpath or XQuery (Any tree like structure).
o Tuple spaces can be queried through tuple space templates
o Data registries can be searched for entries that matches with given
Metadata descriptions (search by properties)
o Graph’s can be queried based on connectivity
§ Structured Data
o Retrieve by Key
o WHERE clauses
o Queries with JOINs
o Offline Queries
Copyright bydigitalART2 and licensed for reuse under CC License ,
http://www.flickr.com/photos/digitalart/2101765353/
14. Consistency and Scalability
§ Scalability – this is ability to
handle more users, data, or
larger files by adding more
nodes. We will have 3 categories.
o Small systems (can handle with 1-3
nodes)
o Scalable systems (can handle with
about 10 nodes)
o Highly scalable systems (anything
larger, can be 100s or 1000s of Copyright NNSANews and licensed for reuse under CC
nodes) License , http://www.flickr.com/photos/nnsanews/
5347287260/
§ Consistency – this is how to keep the replicas of same data
in many nodes synced up (e.g. replicas) how they can be
updated without data corruptions. We will have 3 categories.
o Transactional – series of operations updated in ACID manner
o Atomic operation – single operation, updated in all replicas
o Eventual consistency - data will be eventually consistent
16. Data Storage Implementations
§ Expectations from data
storages
o Reliably store the data
o Efficient search and retrieval
of data whenever needed
o Data management – delete,
update data
Copyright John Atherton by and licensed for reuse under CC
License , http://www.flickr.com/photos/gbaku/2231332836/
17. Challenges of Data Storage
§ Reliability
o Replicating data
o Creating backup or recovering using backups
§ Security
§ Scaling and Parallel access
o Distribution or replications
o ACID transactions
§ Availability
o Data replications
§ Vendor lock-in
o Interoperability, standard query languages
§ Simple use experience
o Hide the physical location of data,
o Provide simple API and security models
o Expressive query languages.
18. Data Storage Choices
Queries
Join Transactio Flexible
Storage Type Advantages Disadvantages Key Where s ns Scale schema
No unless
Local memory Very fast Not durable Yes No No STMs No Yes
Rigid schema,
good for read
oriented Moder
Relational/ SQL Standardized usecases. Yes Yes Yes Yes ate No
Column High write Not Yes,
families performance, transactional, secondar
(NoSQL ) replicated no-online joins Yes y index No No High Yes
High write Not
Documents performance, transactional, Yes,
DBs replicated no-online joins Yes views No No Yes Yes
Easy to integrate
with
Object Struct programming
Databases ured languages Yes Yes Yes Yes No No
19. Queries trans
Disadvanta action Flexible
Storage Type Advantages ges Key Search s Scale schema
No
structured
Save big files whose search on
Files format not understood content Yes Indexing No Moderate Yes
Data
Registries/ Metadata search Property
Metadata Unstru based search
Catalogs ctured Yes (Where) No Moderate Yes
Representation of flow
of messages over
Queues time/ Tasks Yes N/A No Yes Yes
Used to inference, very
Triple fast relationship Relationship
Stores processing Yes search No No Yes
XML XPath/
database XML native XQuery
Distributed
Cache Fast, replicated No search Yes No No Yes Yes
Model is too
simple in
some
High write cases, not
Key-value performance, transactiona
pairs replicated l Yes No No Yes Yes
Semi- Very fast joins, natural
structur to represent Not very
Graph DBs ed relationships, scalable Yes Graph Search Yes Low N/A
21. How do We do this?
Copyright 8664 and
licensed for reuse
under CC License ,
http://www.flickr.com/
photos/
80464769@N00/186
598462/
§ Consider structured, semi-structured, and unstructured
separately.
o Then drill down based on other 3 properties: scale, consistency,
and search.
§ Structured case is more complicated, other two are bit
simpler.
§ Start by giving a defacto for each case
22. Handling Structured Data
§ There are three main considerations: scale, consistency
and queries
Small (1-3 nodes) Scalable (10 nodes) Highly Scalable (1000s
nodes)
Loose Operat ACID Loose Operat ACID Loose Operat ACID
Consist ion Transa Consi ion Transa Consi ion Transa
ency Consi ctions stency Consi ctions stency Consi ctions
stency stency stency
Primary DB/ KV/ DB/ DB KV/CF KV/CF Partitio KV/CF KV/CF No
Key CF KV/ CF ned
DB?
Where DB/ CF/ DB/ DB CF/ CF/ Partitio CF/ CF/ No
Doc CF/ Doc(?) Doc (?) ned Doc Doc
Doc DB?
JOIN DB DB DB ?? ?? ?? No No No
Offline DB/CF/ DB/CF/ DB/CF/ CF/ CF/ No CF/ CF/ No
Doc Doc Doc Doc Doc Doc Doc
*KV: Key-Value Systems, CF: Column Families, Doc: document based Systems
23. Handling Small Scale Systems (1-3 nodes)
Small (1-3 nodes) § In general using DB here for
every case might work.
Loose Operati ACID
Consi on Transa § Reason for using options
stency Consist ctions other than DB
ency o When there is potential need
Primary DB/ DB/ KV/ DB to scale later.
Key KV/ CF CF o High write throughput
Where DB/ DB/ DB § KV is 1-D where as other two
CF/ CF/Doc
Doc
are 2D
JOIN DB DB DB
Offline DB/ DB/CF/ DB/CF/
CF/ Doc Doc
Doc
*KV: Key-Value Systems, CF: Column
Families, Doc: document based
Systems
24. Handling Scalable Systems
Scalable (10 nodes) § KV, CF, and Doc can easily
handle this case.
Loose Operati ACID § If DBs used with data shredded
Consi on Transa
stenc Consist ctions across many nodes
y ency o Transactions might work given that
Primary KV/CF KV/CF Partition participants on one transaction are
Key ed DB? not too many.
Where CF/ CF/Doc Partition
o JOINs might need to transfer too
Doc ed DB? much data between nodes.
o Also should consider in Memory
JOIN ?? ?? Partition
ed DBs like Vault DB.
DB?? § Offline mode will work.
Offline CF/ CF/Doc No § Most systems let users choose
Doc
consistency, and loose
*KV-Key-Value Systems, CF-Column
consistency can scale more.
Families, Doc- document based Systems (e.g. Cassandra)
25. Highly Scalable Systems
§ Transactions do not work in
Highly Scalable (1000s
nodes) this scale. (CAP theorem).
Loose Operati ACID § Same for JOINs. The problem
Consis on Transac is sometime too much data
tency Consist tions
ency needs to be transferred
Primary KV/CF KV/CF No
between nodes to perform the
Key JOIN.
Where CF/Doc CF/Doc No § Offline case handled through
Map-Reduce. Even JOIN
JOIN No No No case is OK since there is
time.
Offline CF/Doc CF/Doc No
*KV: Key-Value Systems, CF: Column
Families, Doc: document based
Systems
26. Highly Scalable Systems + Primary Key Retrieval
Highly Scalable (1000s § This is (comparatively) the
nodes) easy one.
Loose Operat ACID § Can be solved through
Consis ion Transa
tency Consis ctions
DHT (Distributed Hash
tency table) based solutions or
Primar KV/CF KV/CF No architectures like
y Key OceanStore.
Where CF/Doc CF/Doc No § Both Key-Value storage
(?) (?)
(KV) and Column Families
JOIN No No No
(CF) can be used. But
Key-Value model is
Offline CF/Doc CF/Doc No
preferred as it is more
scalable.
*KV-Key-Value Systems, CF-Column
Families, Doc- document based
Systems
27. Highly Scalable systems + WHERE
Highly Scalable (1000s § This Generally OK, but tricky.
nodes)
§ CF work through a Secondary
Loose Operat Transa
Consis ion ctions index that do Scatter-gather
tency Consis (e.g. Cassandra).
tency
§ Doc work through Map-
Primar KV/CF KV/CF No
y Key Reduce views (e.g.
Where CF/Doc CF/Doc No
CouchDB)
(?) (?) § There is Bissa, which build a
JOIN No No No index for all possible queries
(No range queries)
Offline CF/Doc CF/Doc No § If you are doing this, you
should do pilot runs and
*KV-Key-Value Systems, CF-Column make sure things work.
Families, Doc- document based
Systems
28. Handling Unstructured Data
§ Storage Options
o Distributed File systems - generally scalable (e.g. NSF), but HDFS
(Hadoop) and Lustre are highly scalable versions.
o Metadata registries (e.g. Niravana, SDSC Resource Broker)
29. Handling Semi-Structured Data
Small Scale (1-3 Scalable (10 nodes) Highly
nodes) Scalable
XML (Queried XML DB or convert XML DB or convert to a ??
through XPath) to a structured structured model
model
Graphs Graph DBs Graph DBs if graph can ??
be partitioned
Data Structures Data Structure
Servers, Object
Databases
Queues Distributed Distributed Queues Distributed
Queues Queues
!
§ Storage Options
o Answer depends on the type of structure. If there is a server
optimized for a given type, it is often much more efficient than
using a DB. (e.g. Graph databases can support fast relationship
search)
§ Search
o Very much custom. E.g. XML or any tree = Xpath, Graph can
support very fast relationship search
30. Hybrid Approaches
§ Some solutions have many types
of data and hence need more than
one data solution (hybrid
architectures).
§ For example
o Using DB for transactional data and
CF for other data.
o Keeping metadata and actual data
separate for large data archives.
o Use GraphDB to store relationship
data while other data is in Column
Family storage. Copyright Matthew Oliphant by and licensed for
§ However, if transactions are reuse under CC License , http://www.flickr.com/
photos/fajalar/3174131216/
needed, transactions have to be
handled outside storage (e.g.
using Atomikos Zookeeper ).
31. Other parameters
§ Above list is not exhaustive, and there are other
parameters
o Read/ Write ratio – when high it is easy to scale
o High write throughput
o Very large data products – you will need a file system. May be
keep metadata in Data registry and store data in a file system.
o Flexible Schema
o Archival usecases
o Analytical usecases
o Others …
§ So there is no silver bullet …
32. Conclusion
§ For last 20 years or so, DBMS were the de facto storage
solution
§ However, DBMS could not scale well, and many NoSQL
solutions have been proposed instead
§ As a results. it is no longer easy to find the best data
solution for your problem.
§ We discussed may dimensions (types of data, scalability,
queries, and consistency) and provided guidelines on when
to use which data solution.
§ Your feedback and thoughts are most welcome .. Contact
me through srinath@wso2.com