In modern Software Development and Software Architecture, selecting the right DataStore is one of the most challenging and important task. In this presentation, I have summarized the major DataStores and the decision criteria to select the right DataStore according to the use case.
4. Why Right DataStore Matters?
One size does
not fit all
Fulfil the
Functional
Requirements
Fulfil the
Non-Functional
Requirements
Avoid rewriting
the Code due to
DataStore
change
Reduce
migration cost
Enable faster
development
Reduce
operational cost
Reduce
maintenance
cost
5. CAP Theorem
Distributed DataStore: Data is stored in more than one
Node (using Sharding and Replication)
Consistency: All clients see the same
view of data, even right after update or
delete
Availability: All clients can always read and write
Partition-tolerance: The system continues
to work as expected in case of network partition
(communication loss or delay between nodes).
6. Relational Database (SQL)
Key Features:
• Based on E.F. Codd’s paper on RDBMS (1970)
• Table based and relational
• Multi-table, multi-row ACID transactional guarantee
• Vertically scalable
• Referential integrity
• Normalization of Data
• Structured Query language (SQL)
• Battle-tested
• Sharding is managed by the Application/Middleware
• CA
• Example: PostgreSQL, Oracle, MS-SQL, MySQL
7. Relational
Database
When to use:
• As OLTP Database with ACID transactional guarantee
• Structured Data
• Vertically scalable Data
• Data is relational
When not to use:
• As OLAP Database
• Semi-structured (e.g. JSON, XML) or unstructured
Data
• Horizontal scalability
• Geographic Data distribution is required
• Data is extremely relational
8. Key-Value Store
Key Features:
• Data structure is Key-Value pair (HashTable)
• Value can be wide range of data structures (objects)
• Horizontally scalable using shared-nothing sharding
• Sharding is managed by the Database
• Schemaless
• Data redundancy and duplication due to denormalization
• In memory Key-Value store can be used as distributed Cache
• CP or AP
• Example: Redis, Memcached, RocksDB
9. Key-Value
Store
When to use:
• As OLTP Database with no ACID transactional
guarantee
• High throughput, low latency Read/Write
• Horizontal scalability with sharding managed by the
Database
• Large amount of Dataset
• In-Memory Key-Value Store:
• Improved database access performance
• CMS, Real-time systems
When not to use:
• Dataset is small
• As OLTP Database with ACID transactional guarantee
• As OLAP Database
• Data is extremely relational
10. Document Database
Key Features:
• Database to store semi-structured Data (e.g. JSON, XML)
• Schemaless
• Multi-document ACID transactional guarantee
• Data redundancy and duplication due to denormalization
• Horizontally scalable
• Sharding is managed by the Database
• CP or AP
• Example: MongoDB, CouchDB
11. Document
Database
When to use:
• As OLTP Database with limited ACID transactional
guarantee
• Data is semi-structured with advanced query features
• Rapid application development
• Data is schemaless
• Horizontal scaling with sharding managed by the Database
• Documents give better performance over normalized table
due to data structure
When not to use:
• Data is structured
• As OLAP Database (OLAP)
• Multi-table (collection) ACID transactional guarantee is
needed
• Data is extremely relational
12. Wide Column Store
Key Features:
• Two dimensional key-value store
• Column families are stored separately
• Schemaless
• Horizontally scalable with shared nothing sharding
• Sharding is managed by the Database
• Data redundancy and duplication due to denormalization
• AP
• Low latency write operations
• Example: Cassandra, ScyllaDB, BigTable
13. Distributed
Wide Column
Store
When to use:
• As OLTP Database with no ACID transactional
guarantee
• Planet scale database with massive amount of
write/read
• Near linear horizontal scalability with sharding
managed by the Database
• Extremely large amount of Dataset
• As OLAP Database with additional OLAP tools (e.g.
Spark)
• Extremely low latency write/read
When not to use:
• As OLTP Database with ACID transactional guarantee
• Dataset is small
• Data is extremely relational
• Data is document (e.g. JSON)
14. Graph Database
Key Features:
• Use Graph Data structure (nodes,
edges, properties)
• Relationships are first class citizens
• Optimal for highly connected Dataset
• Use Graph Algorithms (e.g. Graph Traversal)
for faster queries.
• ACID Transactional guarantee.
• CP
• Example: Neo4j, Gremlin
Source: https://neo4j.com
15. Graph
Database
When to use:
• Data is extremely relational
• Relationship in the Data is very important
• As OLTP Database with ACID transactional guarantee
• Schema is evolving
When not to use:
• As analytics Database (OLAP)
• Data is not relational (disconnected) or lowly
relational
• Data is Document
• Key-Value store
16. Distributed SQL
Key Features:
• Table based and relational
• Multi-table, multi-row ACID transactional guarantee within some constraint (e.g. in Availability
zones)
• Horizontally scalable
• Sharding is managed by the Database
• Geographic Data Distribution
• Referential integrity
• Structured Query language (SQL)
• CP with very high availability
• Example: Google Spanner, CockroachDB, YugabyteDB
17. Distributed
SQL
When to use:
• As OLTP Database with ACID transactional guarantee
• Near linear horizontal scalability with sharding managed
by the Database
• Consistency, Availability and Partition-tolerance within an
SLA
• Geographic Data distribution is required
• Data is structured and relational
When not to use:
• Geographic Data distribution is not required
• Lower Database price is desired
• As OLAP Database
• Semi-structured (e.g. JSON, XML) or unstructured Data
• Vertically scalable data
• Data is extremely relational
18. Search Engine
Key Features:
• Provide Full-text search using Inverted Index
• Supports stop-word, synonyms, auto correction
• Data is structured or semi-structured
• Horizontally scalable
• Geo queries (location based search)
• CP
• Example: Apache Solr, Elasticsearch
Source: https://community.hitachivantara.com
19. Search Engine
When to use:
• Moderate to advanced full-text search is needed
• Horizontal scalability with sharding managed by
the Database
• GEO search is needed
• Structured or semi-structured data (e.g. Log Data,
JSON, XML)
When not to use:
• As operational Database (OLTP)
• As analytics Database (OLAP)
• Data is extremely relational
20. Object Storage
Key Features:
• Manages Data as objects
• Store Data as well as Metadata (Unique ID, Security Info)
• Single Repository (Flat hierarchy)
• REST API for CRUD operations
• Used mainly for unstructured and semi-structured data
• Provide globally unique identifier to access the data
• AP
• Data replication and data distribution at object-level granularity
• Example: Amazon S3, Azure Blob Storage, Google Cloud Storage
21. Object
Storage
When to use:
• To store unstructured and semi-structured data with
Object level granularity (e.g. Streaming Videos, Images,
CSV/XML files, Backups)
• High availability and high durability
• To reduce cost
• As Data lake
• Automatic Backup and Redundancy
When not to use:
• As operation Database (OLTP)
• As analytical Database (OLAP)
• Block Storage, directories
• Structured data
22. Data Warehouse
Key Features:
• Large-scale Analytical (OLAP) Database
• Data abstraction is structed and relational
• Central repositories of all analytical data
• Columnar store
• Supports SQL
• Horizontally Scalabale
• AP
• Massively Parallel Processing (MPP)
• Efficient distributed query execution engine
• Petabyte, Exabyte scale dataset
• Example: Google BigQuery, Snowflake, Amazon Redshift
Source: https://datawarehouseinfo.com
23. Data
Warehouse
When to use:
• As Analytical Database with Complex analytical queries
• Extremely large Dataset (Petabyte)
• Limitless scaling
• Faster querying for large-scale database
• BI and advanced analytics are critical for the company
• Value added features like Machine Learning
When not to use:
• As operation Database (OLTP)
• Dataset is small
• Data Warehouse is overkill for the company due to
price or nature of business
• OLTP Databases or Data Lakes are used for Analytics
25. Future
One size fits many
Hybrid Transactional/Analytical Processing (HTAP)
Multi-Model Database
SQL, NoSQL will learn from each other
Multi-Cloud Database-as-a-Service (DBaaS)
Serverless Database
Specialized Hardware for different Database
Many new Databases