Database Architecture & Scaling Strategies, in the Cloud & on the Rack

© 2014 CLUSTRIX© 2015 CLUSTRIX
Database Scaling Strategies,
in the Cloud & on the Rack
Robbie Mihalyi
@Clustrix

SQL SCALE-OUT
ClustrixDB Overview2
Resiliency
Capacity
Elasticity Cloud

Cloud
o Commoditized hardware resources
 Rapid deployment and pay by the hour
o Access
 Publish your applications quickly
 Use existing services from provider
o Capacity
 Scale resources as you need them
Utility Computing (bare metal)
Platform as a Service (PaaS)
SaaS
o Virtualized (Shared) Resources
 You do not always get the performance
envelope you ask for
o Dedicated (Hardware) Resources
 Available but expensive
 Less flexible

E-Commerce Applications
Example of a Great Match for Cloud
o Need for capacity varies by seasonality and specific events
 Some events can generate 10x normal traffic & increased conversion rates
o Sensitive to performance characteristics
 Throughput and latency
o Up-time is most crucial at the busiest time
 Every minute of downtime can mean thousands of $$$$ in lost revenue

SQL SCALE-OUT
Resiliency
Capacity
Elasticity

SQL SCALE-OUT
Resiliency
Capacity
Elasticity
SCALE
 Data, Users, Session
THROUGHPUT
 Concurrency, Transactions
LATENCY
 Response Time

Application Scaling (App Layer Only)
Easy Installation and Setup
o Load-Balancer
 HAProxy or equivalent
 Distributes incoming requests
o Scale out by adding servers
 All servers are the same – no master
o Redundant backend network
 Low-latency cluster intercommunication
Load Balancer
Commodity servers
APP
APP
APP

Application Scaling (Database Layer)
Database Scaling Is Very Hard
o Data Consistency
o Read vs. Write Scale
o ACID Properties (if you care about it)
o Throughput and Latency
o Application Impact

Non-Relational (NoSQL) Database Architectures
o No imposed structure
o Relaxed or no ACID properties
 BASE – alternative to ACID
o Fast and Scalable
o Suited for specific applications
 IOT, click-stream, object store, document
 Good for Insert workload
 Not good for read / query apps
o RDBMS will provide fast non-structured data
store

RDBMS SCALING

Scaling-Up
o Keep increasing the size of the (single) database server
o Pros
 Simple, no application changes needed
o Cons
 Expensive. At some point, you’re paying 5x for 2x the performance
 ‘Exotic’ hardware (128 cores and above) become price prohibitive
 Eventually you ‘hit the wall’, and you literally cannot scale-up anymore

Scaling Reads: Master/Slave
o Add a ‘Slave’ read-server(s) to your ‘Master’ database server
o Pros
 Reasonably simple to implement.
 Read/write fan-out can be done at the proxy level
o Cons
 Only adds Read performance
 Data consistency issues can occur, especially if the application isn’t coded to
ensure reads from the slave are consistent with reads from the master

Scaling Writes: Master/Master
o Add additional ‘Master’(s) to your ‘Master’ database server
o Pros
 Adds Write scaling without needing to shard
o Cons
 Adds write scaling at the cost of read-slaves
 Adding read-slaves would add even more latency
 Application changes are required to ensure data consistency / conflict resolution

Scaling Reads & Writes: Sharding
SHARDO1 SHARDO2 SHARDO3 SHARDO4
o Partitioning tables across separate database servers
o Pros
 Adds both write and read scaling
o Cons
 Loses the ability of an RDBMS to manage transactionality, referential integrity and ACID
 ACID compliance & transactionality must be managed at the application level
 Consistent backups across all the shards are very hard to manage
 Read and Writes can be skewed / unbalanced
 Application changes can be significant
A - K L - O P - S T - Z

Scaling Reads & Writes: MySQL Cluster
o Provides shared-nothing clustering and auto-sharding for MySQL. (designed for Telco
deployments: minimal cross-node transactions, HA emphasis)
o Pros
 Distributed, multi-master model
 Provides high availability and high throughput
o Cons
 Only supports read-committed isolation
 Long-running transactions can block a node restart
 SBR replication not supported
 Range scans are expensive and lower performance than MySQL
 Unclear how it scales with many nodes

Application Workload Partitioning
o Partition entire application + RDBMS stack across
several “pods”
o Pros
 Adds both write and read scaling
 Flexible: can keep scaling with addition of pods
o Cons
 No data consistency across pods (only suited for cases
where it is not needed)
 High overhead in DBMS maintenance and upgrade
 Queries / Reports across all pods can be very complex
 Complex environment to setup and support
APP
APP
APP
APP
APP
APP

SQL SCALE-OUT
Resiliency
Capacity
Elasticity

SQL SCALE-OUT
Resiliency
Capacity
Elasticity
Ease of ADDING and
REMOVING resources
Flex Up or Down
 Capacity On-Demand
Adapt Resources to Price-
Performance Requirements

Elasticity – flexing up and down
o Application (only)
o NoSQL databases
o Scale-up
o Master – Slave
o Master – Master
o Sharding
o MySQL Cluster
o Application Partitioning
Scaling Options Flex UP Flex DOWN
o Easy o Easy
o Easy o Unclear if it is possible
o Expensive o Not Applicable
o Reasonably simple o Turn off read slaves
o Involved o Involved
o Expensive and complex o Not feasible
o Involved o Involved
o Expensive and complex o Expensive and complex

SQL SCALE-OUT
Resiliency
Resilience to Failures
 Hardware or Software
Fault Tolerance and
High Availability
Capacity
Elasticity

Resiliency – high-availably and fault tolerance
o Application (only)
o NoSQL databases
o Scale-up
o Master – Slave
o Master – Master
o Sharding
o MySQL Cluster
o Application Partitioning
Scaling Options
o No single point failure – failed node bypassed
Resilience to failures
o Support exists
o One large machine  Single point failure
o Fail-over to Slave
o Resilient to one of the Masters failing
o Multiple points of failures
o No single point failure
o Multiple points of failures

RDBMS Capacity, Elasticity and Resiliency
Scale-up
Master – Slave
Master – Master
MySQL Cluster
Sharding
RDBMS Scaling
Many cores – very expensive
Reads Only
Read / Write
Read / Write
Unbalanced Read/Writes
Capacity
Single Point Failure
Fail-over
Yes
Yes
Multiple points of failure
ResiliencyElasticity
No
No
No
No
No
None
Yes – for read scale
High – update conflict
None (or minor)
Very High
Application Impact

CLUSTRIXDB
 FULL ACID COMPLIANT RDBMS
 MYSQL COMPATIBLE
 ARCHITECTED FROM THE GROUND-UP TO ADDRESS:
CAPACITY, ELASTICITY AND RESILIENCY.

ClustrixDB – Shared Nothing Symmetric Architecture
Each Node Contains
o Database Engine:
 all nodes can perform all database operations (no
leader, aggregator, leaf, data-only, special nodes)
o Query Compiler:
 distribute compiled partial query fragments to the
node containing the ranking replica
o Data: Table Slices:
 All table slices auto-redistributed by the
Rebalancer (default: replicas=2)
o Data Map:
 all nodes know where all replicas are
ClustrixDB
Compiler Map
Engine Data
Compiler Map
Engine Data
Compiler Map
Engine Data

BillionsofRows
Database
Tables
S1 S2
S2
S3
S3
S4
S4
S5
S5
Intelligent Data Distribution
o Tables auto-split into slices
o Every slice has a replica on another server
 Auto-distributed and auto-protected
S1
ClustrixDB

S1
S2
S3
S3
S4
S4
S5
Database Capacity And Elasticity
o Easy and simple Flex Up (and Flex Down)
 Flex multiple nodes at the same time
o Data is automatically rebalanced
across the cluster
o All servers handle writes + reads
o Application always sees a single
Database instance
S1
ClustrixDB
S2
S5

S1
S2
S3
S3
S4
S4
S5
Built-in Fault Tolerance
o No Single Point-of-Failure
 No Data Loss
 No Downtime
o Server node goes down…
 Data is automatically rebalanced across
the remaining nodes
S1
ClustrixDB
S2
S5

Query
Distributed Query Processing
o Queries are fielded by any peer node
 Routed to node holding the data
o Complex queries are split into fragments processed in parallel
 Automatically distributed for optimized performance
ClustrixDB
Load
Balancer
TRXTRXTRX

Replication and Disaster Recovery
Asynchronous multi-point Replication
ClustrixDB
Parallel Backup
up to 10x faster
Replicate to any cloud, any datacenter, anywhere

CLUSTRIXDB
UNDER THE HOOD
o DISTRIBUTION STRATEGY
o REBALANCER TASKS
o QUERY OPTIMIZER
o EVALUATION MODEL
o CONCURRENCY CONTROL

ClustrixDB key components enabling Scale-Out
o Shared-nothing architecture
 Eliminates potential bottlenecks.
o Independent Index Distribution
 Hash each distribution key to a 64-bit number space divided into ranges with a specific slice
owning each range
o Rebalancer
 Ensures optimal data distribution across all nodes.
 Rebalancer assigns slices to available nodes for data capacity and access balance
o Query Optimizer
 Distributed query planner, compiler, and distributed shared-nothing execution engine
 Executes queries with max parallelism and many simultaneous queries concurrently.
o Evaluation Model
 Parallelizes queries, which are distributed to the node(s) with the relevant data.
o Consistency and Concurrency Control
 Using Multi-Version Concurrency Control (MVCC) and 2 Phase Locking (2PL)

Rebalancer Process
o User tables are vertically partitioned in representations.
o Representations are horizontally partitioned into slices.
o Rebalancer ensures:
 The representation has an appropriate number of slices.
 Slices are well distributed around the cluster on storage devices
 Slices are not placed on server(s) that are being flexed-down.
 Reads from each representation are balanced across the nodes

ClustrixDB Rebalancer Tasks
o Flex-UP
 Re-distribute replicas to new nodes
o Flex-DOWN
 Move replicas from the flex-down nodes to other nodes in the cluster
o Under-Protection – when a slice has fewer replicas than desired
 Create a new copy of the slice on a different node.
o Slice Too Big
 Split the slice into several new slices and re-distribute them

ClustrixDB Query Optimizer
o The ClustrixDB Query Optimizer is modeled on the Cascades optimization framework.
 Other RDBMS leverage Cascades are Tandem's Nonstop SQL and Microsoft's SQL Server.
 Cost-driven - Extensible via a rule based mechanism
 Top-down approach
o Query Optimizer must answer the following, per SQL query:
 In what order should the tables be joined?
 Which indexes should be used?
 Should the sort/aggregate be non-blocking?

ClustrixDB Evaluation Model
o Parallel query evaluation
o Massively Parallel Processing (MPP) for analytic queries
o The Fair Scheduler ensures OLTP prioritized ahead of OLAP
o Queries are broken into fragments (functions).
o Joins require more data movement by their nature.
 ClustrixDB is able to achieve minimal data movement
 Each representation (table or index) has its own distribution map,
allowing direct look-ups for which node/slice to go to next, removing
broadcasts.
 There is no a central node orchestrating data motion. Data moves
directly to the next node it needs to go to. This reduces hops to the
minimum possible given the data distribution.
COMPILATION
FRAGMENTS
FRAGMENT
1
FRAGMENT
2
VM
FRAGMENT 1
Node := lookup id = 15
<forward to node>
VM
FRAGMENT 2
SELECT id, amount
<return>
SELECT id, amount
FROM donation
WHERE id=15

Concurrency Control
Time
reader
reader
writer
writer
writer
row conflict one
writer blocked
no conflict
no blocking
o Readers never interfere with writers (or vice-versa). Writers use explicit locking for updates
o MVCC maintains a version of each row as writers modify rows
o Readers have lock-free snapshot isolation while writers use 2PL to manage conflict
Lock Conflict Matrix
Reader Writer
Reader None None
Writer None Row

CLUSTRIXDB
DEPLOYMENT EXAMPLES

Example: Huge Write Workload (AWS Deployment)
The Application
Inserts 254 million / day
Updates 1.35 million / day
Reads 252.3 million / day
Deletes 7,800 / day
The Database
Queries 5-9k per sec
CPU Load 45-65%
Nodes - Cores 10 nodes - 80 cores

Example: Huge Update Workload (Bare-Metal Deployment)
The Application
Inserts 31.4 million / day
Updates 3.7 billion / day
Reads 1 billion / day
Deletes 4,300 / day
The Database
Queries 35-55k per sec
CPU Load 25-35%
Nodes - Cores 6 nodes - 120 cores

CLUSTRIXDB
IN DEVELOPMENT

Next Release
o Additional Performance Improvements
 Further improvements to read and write scaling
o Deployment and Provisioning Optimization
 Cloud templates and deployment scripts
 Instance testing and validation
o New Admin architecture and much improved Web UI
 Services based architecture with (RESTful) API
 Simplified single-click FLEX Management
 Significant Graphing and Reporting improvements
 Multi-Cluster topology view and management

New Web UI – Enhanced Dashboard
482 tps

New Web UI – Historical Workload Comparison

New Web UI – FLEX Administration

FINAL THOUGHTS

Capacity
Massive
read write scalability
Very high
concurrency
Linear throughput
scale
Elasticity
Flex UP in
minutes
Flex DOWN
easily
Right-size resources
on-demand
Resiliency
Automatic, 100%
fault tolerance
No single
point of failure
Battle-tested
performance
Flexible
Deployment
Cloud, VM, or
bare-metal
Virtual Images
available
Point/click
Scale-out
ClustrixDB

Thank You.
facebook.com/clustrix
www.clustrix.com
@clustrix
linkedin.com/clustrix

Competitive Cluster Solutions
o Most MySQL clustering solutions leverage Master/Master via
replication:
 MySQL Cluster
 Galera (open-source library)
 Percona XtraDB Cluster (leverages Galera replication library)
 Tungsten
o ClustrixDB does NOT use replication to keep all the servers in
sync
 Replication cannot scale writes as highly as our own technology
 Replication has inherent potential consistency and latency issues
 Transactional workloads such as OLTP (e.g. E-Commerce) are
exactly the workloads that replication struggles the most with

MySQL Cluster
o Provides shared-nothing clustering and auto-sharding for MySQL (designed for
Telco deployments: minimal cross-node transactions, HA emphasis)
o Pros:
 Distributed, multi-master with no SPOF
 Designed to provide high availability and high throughput with low latency, while
allowing for near linear scalability
 Synchronous replication, 2-Phase Commit
o Cons:
 Global checkpoint is 2sec. “There are no guaranteed durable COMMITs to disk”
 Only supports read_committed isolation
 “MySQL cluster does not handle large transactions well”
 Long-running transactions can block a node restart
 Overflow of data in replication stream drops node from cluster, consistency loss
 ‘True’ HA requires multiple replication lines; “1 is not sufficient” for HA
 DELETEs release memory for same-table; full release requires cluster rolling restart
 Range scans are expensive and low(er) performance than MySQL
 No distributed table locks

Galera Cluster
o Is a multi-master topology using their own replication protocol (designed
primarily for High-Availability, and secondarily for scale)
o Pros:
 Writes to any master are replicated to the other master(s) in sync, ensuring all
masters have the same data.
 It is open source, and 24/7 Support can be purchased for $7,950/yr/server. Percona
also provides support, for a higher price.
o Cons:
 Write-scale is limited. Galera support recommends that writes go to one master,
rather than be distributed across the nodes. That helps with isolation issues, but
increases consistency and latency issues across the nodes.
 Snapshot isolation does NOT use first-committer-wins (and so fails Aphyr Jepsen
CAP tests). ClustrixDB does use first-committer wins for snapshot consistency
 Writesets are processed as a single memory-resident buffer and as a result,
extremely large transactions (e.g. LOAD DATA) may adversely affect node
performance.
 Locking is lax with DDL. Eg, if your DML transaction uses a table, and a parallel DDL
statement is started, Galera won’t wait for a metadata lock, causing potential
consistency issues

Percona XtraDB Cluster
o Is an active/active high availability and high scalability open source solution for
MySQL® clustering. It integrates Percona Server and Percona XtraBackup with the
Galera replication library
o Pros:
 Synchronous replication
 Multi-master replication support
 Parallel replication
 Automatic node provisioning
o Cons:
 Not designed for write scaling
 SELECT FOR UPDATE can easily create deadlocks
 Not true synchronous replication, but ‘virtually synchronous’: The data is committed on the
originating node and ack is sent to the application, but the other nodes are committed
asynchronously. This can lead to consistency issues for applications reading from the other
nodes
 “If multiple nodes are used, the ability to read your own writes is not guaranteed. In that case,
a certified transaction, which is already committed on the originating node can still sit in the
receive queue of the node the application is reading from, waiting to be applied.”


Tungsten Replicator
o Is an open source replication engine. Compatible with MySQL, Oracle, and
Amazon RDS; NoSQL stores such as MongoDB, and datawarehouse
stores such as Vertica, InfiniDB, and Hadoop
o Pros:
 Allows data to be exchanged between different databases and different database
versions
 During replication, information can be filtered and modified, and deployment can
be between on-premise or cloud-based databases
 For performance, Tungsten Replicator includes support for parallel replication,
and advanced topologies such as fan-in, star and multi-master, and can be used
efficiently in cross-site deployments
o Cons:
 Very complicated to setup, maintain
 No automated management, automated failover, transparent connections, nor
built-in conflict resolution
 Only allows asynchronous replication
 Cannot suppress slave-side triggers. Need to alter each trigger to add an IF
statement that prevents the trigger from running on the slave.

Database Architecture & Scaling Strategies, in the Cloud & on the Rack

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Database Architecture & Scaling Strategies, in the Cloud & on the Rack

Similar to Database Architecture & Scaling Strategies, in the Cloud & on the Rack (20)

Recently uploaded

Recently uploaded (20)

Database Architecture & Scaling Strategies, in the Cloud & on the Rack

Editor's Notes