SlideShare uma empresa Scribd logo
1 de 38
Building Data Pipelines with SMACK:
Storage Strategies for Scale & Performance
June 8, 2016
Jonathan Shook, Solution Architect, DataStax
Spark
Mesos
Akka
Cassandra
Kafka
1 Essential Storage Concepts
2 Design Strategies
3 Storage Selection
4 Q & A
3© DataStax, All Rights Reserved.
Essential Storage Concepts
The Basics
Important Terms
• Topology
• Bandwidth, Throughput, Headroom
• Latency, Minimum Latency
• Concurrency, Parallelism, Contention
© DataStax, All Rights Reserved. 5
Basic System Topology
6
Every modern system is
essentially a network of
components.
The language of message
delivery applies at every
level of design.
System Topology Example (high level)
HDD SSD
Term: Bandwidth, Throughput, Headroom
• Bandwidth - Maximum rated transfer speed of a device
• Throughput - Measurement of achievable transfer speed
• Headroom - Safety margin above normal usage - “reserve
capacity”
© DataStax, All Rights Reserved. 7
Throughput Example: SATA3
Using a popular SSD and an online benchmark...
© DataStax, All Rights Reserved. 8
Bandwidth Throughput Headroom
6Gb/s (750MB/s) 40MB-500MB as
tested, depending
on operation type
30%, for example.
This is a design
parameter.
In this case, if you can achieve 200MB throughput on the drive
for your operational patterns, headroom of 30% means you
should be scaling out before your metrics show 140MB/s.
Term: Latency and Minimum Latency
• Latency - How long it takes to receive a response, once a
request is submitted
• Minimum Latency - Latency which is possible on a single
node when there is no resource contention
© DataStax, All Rights Reserved. 9
Single Node Replica Set of 3 Nodes and
LOCAL_QUORUM
• However fast that node can service the
request, uncontended.
• Writes: The fastest 2 of 3 nodes in the
replica set to respond.
• Reads: Usually the fastest 2 of 3, based
on latency trends.
Latency and Throughput Example:
Random reads at different block sizes
© DataStax, All Rights Reserved. 10
SATA HDD has an unavoidable
seek time penalty for all op sizes.
Throughput tops out at 180MB/s
at 16MB read sizes and over 1.5
seconds of latency.
SATA SSD performs well.
550MB is possible, but
desirable latencies are found
below 1MB read size.
The NVMe drive can push 2
CDs worth of data per second
at 128KB read sizes. At 16MB,
latency is only .25 seconds.
© DataStax, All Rights Reserved. 11
Latency and Throughput Example:
Compared by Drive Type
This shows the same measurements compared between drive types.
Latency & Throughput Example:
Comparative Numbers
12
1 block read
(512 bytes)
KB/s µs latency iops
NVMe 62006 177 124013
SATA SSD 38700 306 77400
SATA HDD 215 119000 430
256 block read
(128 KB)
KB/s µs latency iops
NVMe 1707520 1160 13339
SATA SSD 549133 2320 4290
SATA HDD 41198 157000 321
32K block read
(16 MB)
KB/s µs latency iops
NVMe 1339596.8 235000 81
SATA SSD 554920 594000 33
SATA HDD 179063 1647000 10
Term: Concurrency, Parallelism, Contention
• Concurrency - Multiple requests in flight
• Parallelism - Simultaneous processing of requests
• Resource Contention - When work is blocked awaiting
access to a shared resource
Concurrency without parallelism causes resource contention,
queueing, latency increases, and unhappy users.
© DataStax, All Rights Reserved. 13
(Storage) Design Strategies
Core Strategies for Going Fast and Staying Fast
Key Design Strategies
1. Design to the Workload
2. Simplify the Storage Path
3. Maintain Headroom
4. Balance Compute and I/O
5. Balance I/O Caching
© DataStax, All Rights Reserved. 15
Strategy #1: Design to the Workload
• Estimate your workloads.
Focus on the read patterns.
• Can your users endure effects
of resource contention?
• Can they endure disruptive
outliers?
• How do you know?
© DataStax, All Rights Reserved. 16
Strategy #2: Simplify the Storage Path
© DataStax, All Rights Reserved. 17
• Avoid unnecessary hardware layers. Go directly from your
system chipset to the drive when possible.
• Favor JBOD over storage aggregation.
• Only use RAID for:
– Datacenter or Operator Standards with HDDs.
(Try to avoid RAID with SSDs if possible.)
– Aggregating smaller disks.
(Why not just get larger drives for JBOD?)
Strategy #3: Maintain Headroom
• Build-in headroom according to your loading patterns.
• Measure your system with bench tools.
• Saturate during non-prod testing, and use that as a reference
point in production.
© DataStax, All Rights Reserved. 18
Strategy #4: Balance Compute and I/O
© DataStax, All Rights Reserved. 19
• Databases are not just storage APIs.
• You need to keep your CPU and IO throughput in relative
balance.
• Perfection is not required, but extreme imbalances are no
fun.
• There will always be a bottleneck.
Strategy #5: Balance I/O Caching
© DataStax, All Rights Reserved. 20
• Understand the potential benefits of caching: best and
worst cases.
• “Unused” memory in Linux is available for caching.
• Don’t depend on cache to solve cold read latencies.
• Design around cold-read performance first.
Storage Selection
Build for Effect
22
It’s a bad idea.
SANs for distributed databases...
Have strong skepticism when anybody tells you otherwise.
Perhaps they haven’t tried it yet, or are ignoring the obvious.
You don’t have to suffer the pains of others in order to learn
from their experiences. Still, some insist on trying.
HDD vs. SSD
23
Type Pro Con
HDD ● Cheap? ● All concurrent operations are contended
● Random access is slow - drive seek
● Power usage
● Lower latencies come with much higher
costs
● Little room for further improvement
SSD ● Cheap? (1TB ~ $300)
● Fast
● Low internal contention
● Runs cooler / lower
wattage
● Faster transport
technology available
● Initial capacities available - encouraged
RAID shenanigans → No longer an issue
for reasonable data densities with
Cassandra/DSE.
● MTBF of earlier designs → No longer an
issue as SSDs have made huge strides in
reliability and DWPD limits
● Initial cost - No longer an issue
Workload Concurrency & Storage Parallelism
© DataStax, All Rights Reserved. 24
Selecting SSD vs. HDD
Favor modern SSDs by default.
Use HDDs only if you must for:
● High-write applications with low read concurrency
● Archival or Logging systems with low read concurrency
● Commit log storage, if you have the option
● Persistent messaging systems
● Non-latency sensitive batch/analytics workloads
25
Storage Path
© DataStax, All Rights Reserved. 26
A) Direct SSD
B) Direct HDD
C) NVMe
D) SSDs via HBA
E) HDDs via HBA
F) Combo via HBA
We’ll come back to this
slide if we have time.
HDD SSD
Data Density
• Keep data density in reasonable bounds.
• Every database must deal with the realities of storage traversal.
• Avoid trying to store too much data on a node.
© DataStax, All Rights Reserved. 27
In Conclusion...
• Provision with headroom to avoid unnecessary contention.
• Select hardware to support user and workload requirements.
• Keep the storage path as simple as possible.
• Consider SSDs by default for your data directories.
28
Coming Soon!
● June 23: Top 5 Reasons Why DSE is Game Changing
● July 7: Proofpoint & DataStax Webinar
● For the latest schedule of webinars, check out our Webinars
page: http://www.datastax.com/resources/webinars
© 2015 DataStax, All Rights Reserved. 29
Get your SMACK on!
Thank You!
Follow me on Twitter: @Shookinator
© 2015 DataStax, All Rights Reserved. 30
THANK YOU!
© 2015 DataStax, All Rights Reserved. 31
Q & A
© 2015 DataStax, All Rights Reserved. 32
Additional Resources
Latency Spectrum for small ops
© DataStax, All Rights Reserved. 34
Math relating to Scale & Performance
Little’s Law
Relates latency, concurrency and throughput as averages.
Ahmdahl’s Law
Relates latency to improvements in working resources.
Pigeonhole principle
Statistics of the pigeonhole principle come up again and again in distributed computing.
Latency numbers every programmer should know.
© DataStax, All Rights Reserved. 35
Online Resources
C* Microbench scripts
Fio scripts to measure a disk subsystem across many C*-style workloads.
https://github.com/jshook/perfscripts
Al’s Tuning Guide: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
© DataStax, All Rights Reserved. 36
Terms: Concurrency, Parallelism, visually
© DataStax, All Rights Reserved. 37
concurrency only concurrency with parallelism
Addendum: What about RAID?
See IBM Patent 4092732 about a 1978 solution to a 1978
problem: drives were very unreliable, and systems were not
resilient to failure. In 1978, parallelism was pronounced
“mainframe”. Times have changed.
System topologies of today expose storage parallelism all
the way to the drive. Cassandra allows drive failure without
cluster failure. Cassandra can make direct use of the
parallelism exposed at the storage layer.
© DataStax, All Rights Reserved. 38

Mais conteúdo relacionado

Mais procurados

Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
DataStax
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
Vinoth Chandar
 

Mais procurados (19)

Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
 
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
 
Performance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. DatastaxPerformance Testing: Scylla vs. Cassandra vs. Datastax
Performance Testing: Scylla vs. Cassandra vs. Datastax
 
Apache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at CernerApache HBase in the Enterprise Data Hub at Cerner
Apache HBase in the Enterprise Data Hub at Cerner
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Pythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra ClusterPythian: My First 100 days with a Cassandra Cluster
Pythian: My First 100 days with a Cassandra Cluster
 
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with CassandraCisco UCS Integrated Infrastructure for Big Data with Cassandra
Cisco UCS Integrated Infrastructure for Big Data with Cassandra
 
Signal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide RowsSignal Digital: The Skinny on Wide Rows
Signal Digital: The Skinny on Wide Rows
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
Time-Series Apache HBase
Time-Series Apache HBaseTime-Series Apache HBase
Time-Series Apache HBase
 
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File FormatScylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
Scylla Summit 2018: Scylla Feature Talks - SSTables 3.0 File Format
 
Introduction to AWS Big Data
Introduction to AWS Big Data Introduction to AWS Big Data
Introduction to AWS Big Data
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 

Destaque

Destaque (20)

Architecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.KArchitecture Big Data open source S.M.A.C.K
Architecture Big Data open source S.M.A.C.K
 
SMACK Stack 1.1
SMACK Stack 1.1SMACK Stack 1.1
SMACK Stack 1.1
 
Data processing platforms with SMACK: Spark and Mesos internals
Data processing platforms with SMACK:  Spark and Mesos internalsData processing platforms with SMACK:  Spark and Mesos internals
Data processing platforms with SMACK: Spark and Mesos internals
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
 
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
Webinar - How to Build Data Pipelines for Real-Time Applications with SMACK &...
 
Laying down the smack on your data pipelines
Laying down the smack on your data pipelinesLaying down the smack on your data pipelines
Laying down the smack on your data pipelines
 
Kick-Start with SMACK Stack
Kick-Start with SMACK StackKick-Start with SMACK Stack
Kick-Start with SMACK Stack
 
[若渴計畫2015.8.18] SMACK
[若渴計畫2015.8.18] SMACK[若渴計畫2015.8.18] SMACK
[若渴計畫2015.8.18] SMACK
 
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiSMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai
 
Advanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMXAdvanced Apache Cassandra Operations with JMX
Advanced Apache Cassandra Operations with JMX
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Rethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For ScaleRethinking Streaming Analytics For Scale
Rethinking Streaming Analytics For Scale
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015Akka in Production - ScalaDays 2015
Akka in Production - ScalaDays 2015
 
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
Spark Kernel Talk - Apache Spark Meetup San Francisco (July 2015)
 
Reactive app using actor model & apache spark
Reactive app using actor model & apache sparkReactive app using actor model & apache spark
Reactive app using actor model & apache spark
 

Semelhante a Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Community
 
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red_Hat_Storage
 
OpenDrives_-_Product_Sheet_v13D (2) (1)
OpenDrives_-_Product_Sheet_v13D (2) (1)OpenDrives_-_Product_Sheet_v13D (2) (1)
OpenDrives_-_Product_Sheet_v13D (2) (1)
Scott Eiser
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
MongoDB
 

Semelhante a Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance (20)

Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
 
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
Ceph Day Shanghai - SSD/NVM Technology Boosting Ceph Performance
 
Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?Why does my choice of storage matter with cassandra?
Why does my choice of storage matter with cassandra?
 
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash TechnologyCeph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
Ceph Day San Jose - Red Hat Storage Acceleration Utlizing Flash Technology
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
 
SanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and CassandraSanDisk: Persistent Memory and Cassandra
SanDisk: Persistent Memory and Cassandra
 
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
 
Výhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database ApplianceVýhody a benefity nasazení Oracle Database Appliance
Výhody a benefity nasazení Oracle Database Appliance
 
SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)SUSE Storage: Sizing and Performance (Ceph)
SUSE Storage: Sizing and Performance (Ceph)
 
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ MemoryRedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
VM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache PrefetchingVM-aware Adaptive Storage Cache Prefetching
VM-aware Adaptive Storage Cache Prefetching
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology Red Hat Ceph Storage Acceleration Utilizing Flash Technology
Red Hat Ceph Storage Acceleration Utilizing Flash Technology
 
Webinar: The Bifurcation of the Flash Market
Webinar: The Bifurcation of the Flash MarketWebinar: The Bifurcation of the Flash Market
Webinar: The Bifurcation of the Flash Market
 
Open Ware Ramsan Dram Ssd
Open Ware Ramsan  Dram SsdOpen Ware Ramsan  Dram Ssd
Open Ware Ramsan Dram Ssd
 
OpenDrives_-_Product_Sheet_v13D (2) (1)
OpenDrives_-_Product_Sheet_v13D (2) (1)OpenDrives_-_Product_Sheet_v13D (2) (1)
OpenDrives_-_Product_Sheet_v13D (2) (1)
 
High Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal CloudHigh Performance, Scalable MongoDB in a Bare Metal Cloud
High Performance, Scalable MongoDB in a Bare Metal Cloud
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 

Mais de DataStax

Mais de DataStax (20)

Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?Is Your Enterprise Ready to Shine This Holiday Season?
Is Your Enterprise Ready to Shine This Holiday Season?
 
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
 
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid EnvironmentsRunning DataStax Enterprise in VMware Cloud and Hybrid Environments
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
 
Best Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise GraphBest Practices for Getting to Production with DataStax Enterprise Graph
Best Practices for Getting to Production with DataStax Enterprise Graph
 
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step JourneyWebinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
 
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...Webinar  |  How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
 
Webinar | Better Together: Apache Cassandra and Apache Kafka
Webinar  |  Better Together: Apache Cassandra and Apache KafkaWebinar  |  Better Together: Apache Cassandra and Apache Kafka
Webinar | Better Together: Apache Cassandra and Apache Kafka
 
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax EnterpriseTop 10 Best Practices for Apache Cassandra and DataStax Enterprise
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
 
Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0Introduction to Apache Cassandra™ + What’s New in 4.0
Introduction to Apache Cassandra™ + What’s New in 4.0
 
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
 
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud RealitiesWebinar  |  Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid CloudHow to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
 
How to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerceHow to Evaluate Cloud Databases for eCommerce
How to Evaluate Cloud Databases for eCommerce
 
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
 
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
 
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
 
Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)Datastax - The Architect's guide to customer experience (CX)
Datastax - The Architect's guide to customer experience (CX)
 
An Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking ApplicationsAn Operational Data Layer is Critical for Transformative Banking Applications
An Operational Data Layer is Critical for Transformative Banking Applications
 
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design ThinkingBecoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

  • 1. Building Data Pipelines with SMACK: Storage Strategies for Scale & Performance June 8, 2016 Jonathan Shook, Solution Architect, DataStax
  • 3. 1 Essential Storage Concepts 2 Design Strategies 3 Storage Selection 4 Q & A 3© DataStax, All Rights Reserved.
  • 5. Important Terms • Topology • Bandwidth, Throughput, Headroom • Latency, Minimum Latency • Concurrency, Parallelism, Contention © DataStax, All Rights Reserved. 5
  • 6. Basic System Topology 6 Every modern system is essentially a network of components. The language of message delivery applies at every level of design. System Topology Example (high level) HDD SSD
  • 7. Term: Bandwidth, Throughput, Headroom • Bandwidth - Maximum rated transfer speed of a device • Throughput - Measurement of achievable transfer speed • Headroom - Safety margin above normal usage - “reserve capacity” © DataStax, All Rights Reserved. 7
  • 8. Throughput Example: SATA3 Using a popular SSD and an online benchmark... © DataStax, All Rights Reserved. 8 Bandwidth Throughput Headroom 6Gb/s (750MB/s) 40MB-500MB as tested, depending on operation type 30%, for example. This is a design parameter. In this case, if you can achieve 200MB throughput on the drive for your operational patterns, headroom of 30% means you should be scaling out before your metrics show 140MB/s.
  • 9. Term: Latency and Minimum Latency • Latency - How long it takes to receive a response, once a request is submitted • Minimum Latency - Latency which is possible on a single node when there is no resource contention © DataStax, All Rights Reserved. 9 Single Node Replica Set of 3 Nodes and LOCAL_QUORUM • However fast that node can service the request, uncontended. • Writes: The fastest 2 of 3 nodes in the replica set to respond. • Reads: Usually the fastest 2 of 3, based on latency trends.
  • 10. Latency and Throughput Example: Random reads at different block sizes © DataStax, All Rights Reserved. 10 SATA HDD has an unavoidable seek time penalty for all op sizes. Throughput tops out at 180MB/s at 16MB read sizes and over 1.5 seconds of latency. SATA SSD performs well. 550MB is possible, but desirable latencies are found below 1MB read size. The NVMe drive can push 2 CDs worth of data per second at 128KB read sizes. At 16MB, latency is only .25 seconds.
  • 11. © DataStax, All Rights Reserved. 11 Latency and Throughput Example: Compared by Drive Type This shows the same measurements compared between drive types.
  • 12. Latency & Throughput Example: Comparative Numbers 12 1 block read (512 bytes) KB/s µs latency iops NVMe 62006 177 124013 SATA SSD 38700 306 77400 SATA HDD 215 119000 430 256 block read (128 KB) KB/s µs latency iops NVMe 1707520 1160 13339 SATA SSD 549133 2320 4290 SATA HDD 41198 157000 321 32K block read (16 MB) KB/s µs latency iops NVMe 1339596.8 235000 81 SATA SSD 554920 594000 33 SATA HDD 179063 1647000 10
  • 13. Term: Concurrency, Parallelism, Contention • Concurrency - Multiple requests in flight • Parallelism - Simultaneous processing of requests • Resource Contention - When work is blocked awaiting access to a shared resource Concurrency without parallelism causes resource contention, queueing, latency increases, and unhappy users. © DataStax, All Rights Reserved. 13
  • 14. (Storage) Design Strategies Core Strategies for Going Fast and Staying Fast
  • 15. Key Design Strategies 1. Design to the Workload 2. Simplify the Storage Path 3. Maintain Headroom 4. Balance Compute and I/O 5. Balance I/O Caching © DataStax, All Rights Reserved. 15
  • 16. Strategy #1: Design to the Workload • Estimate your workloads. Focus on the read patterns. • Can your users endure effects of resource contention? • Can they endure disruptive outliers? • How do you know? © DataStax, All Rights Reserved. 16
  • 17. Strategy #2: Simplify the Storage Path © DataStax, All Rights Reserved. 17 • Avoid unnecessary hardware layers. Go directly from your system chipset to the drive when possible. • Favor JBOD over storage aggregation. • Only use RAID for: – Datacenter or Operator Standards with HDDs. (Try to avoid RAID with SSDs if possible.) – Aggregating smaller disks. (Why not just get larger drives for JBOD?)
  • 18. Strategy #3: Maintain Headroom • Build-in headroom according to your loading patterns. • Measure your system with bench tools. • Saturate during non-prod testing, and use that as a reference point in production. © DataStax, All Rights Reserved. 18
  • 19. Strategy #4: Balance Compute and I/O © DataStax, All Rights Reserved. 19 • Databases are not just storage APIs. • You need to keep your CPU and IO throughput in relative balance. • Perfection is not required, but extreme imbalances are no fun. • There will always be a bottleneck.
  • 20. Strategy #5: Balance I/O Caching © DataStax, All Rights Reserved. 20 • Understand the potential benefits of caching: best and worst cases. • “Unused” memory in Linux is available for caching. • Don’t depend on cache to solve cold read latencies. • Design around cold-read performance first.
  • 22. 22 It’s a bad idea. SANs for distributed databases... Have strong skepticism when anybody tells you otherwise. Perhaps they haven’t tried it yet, or are ignoring the obvious. You don’t have to suffer the pains of others in order to learn from their experiences. Still, some insist on trying.
  • 23. HDD vs. SSD 23 Type Pro Con HDD ● Cheap? ● All concurrent operations are contended ● Random access is slow - drive seek ● Power usage ● Lower latencies come with much higher costs ● Little room for further improvement SSD ● Cheap? (1TB ~ $300) ● Fast ● Low internal contention ● Runs cooler / lower wattage ● Faster transport technology available ● Initial capacities available - encouraged RAID shenanigans → No longer an issue for reasonable data densities with Cassandra/DSE. ● MTBF of earlier designs → No longer an issue as SSDs have made huge strides in reliability and DWPD limits ● Initial cost - No longer an issue
  • 24. Workload Concurrency & Storage Parallelism © DataStax, All Rights Reserved. 24
  • 25. Selecting SSD vs. HDD Favor modern SSDs by default. Use HDDs only if you must for: ● High-write applications with low read concurrency ● Archival or Logging systems with low read concurrency ● Commit log storage, if you have the option ● Persistent messaging systems ● Non-latency sensitive batch/analytics workloads 25
  • 26. Storage Path © DataStax, All Rights Reserved. 26 A) Direct SSD B) Direct HDD C) NVMe D) SSDs via HBA E) HDDs via HBA F) Combo via HBA We’ll come back to this slide if we have time. HDD SSD
  • 27. Data Density • Keep data density in reasonable bounds. • Every database must deal with the realities of storage traversal. • Avoid trying to store too much data on a node. © DataStax, All Rights Reserved. 27
  • 28. In Conclusion... • Provision with headroom to avoid unnecessary contention. • Select hardware to support user and workload requirements. • Keep the storage path as simple as possible. • Consider SSDs by default for your data directories. 28
  • 29. Coming Soon! ● June 23: Top 5 Reasons Why DSE is Game Changing ● July 7: Proofpoint & DataStax Webinar ● For the latest schedule of webinars, check out our Webinars page: http://www.datastax.com/resources/webinars © 2015 DataStax, All Rights Reserved. 29
  • 30. Get your SMACK on! Thank You! Follow me on Twitter: @Shookinator © 2015 DataStax, All Rights Reserved. 30
  • 31. THANK YOU! © 2015 DataStax, All Rights Reserved. 31
  • 32. Q & A © 2015 DataStax, All Rights Reserved. 32
  • 34. Latency Spectrum for small ops © DataStax, All Rights Reserved. 34
  • 35. Math relating to Scale & Performance Little’s Law Relates latency, concurrency and throughput as averages. Ahmdahl’s Law Relates latency to improvements in working resources. Pigeonhole principle Statistics of the pigeonhole principle come up again and again in distributed computing. Latency numbers every programmer should know. © DataStax, All Rights Reserved. 35
  • 36. Online Resources C* Microbench scripts Fio scripts to measure a disk subsystem across many C*-style workloads. https://github.com/jshook/perfscripts Al’s Tuning Guide: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html © DataStax, All Rights Reserved. 36
  • 37. Terms: Concurrency, Parallelism, visually © DataStax, All Rights Reserved. 37 concurrency only concurrency with parallelism
  • 38. Addendum: What about RAID? See IBM Patent 4092732 about a 1978 solution to a 1978 problem: drives were very unreliable, and systems were not resilient to failure. In 1978, parallelism was pronounced “mainframe”. Times have changed. System topologies of today expose storage parallelism all the way to the drive. Cassandra allows drive failure without cluster failure. Cassandra can make direct use of the parallelism exposed at the storage layer. © DataStax, All Rights Reserved. 38