ScyllaDB: Achieving No-compromise Performance

InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
scylladb

Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com

ScyllaDB: Achieving No-Compromise
Performance
Avi Kivity, CTO
@AviKivity
(Hiring!)

Agenda
Background
Goals
Methods
Conclusion

Non-Agenda
● Docker
● Microservices
● Node.js
● Docker
● Orchestration
● JVM GC Tuning
● JSON over HTTP
● Docker

More Non-Agenda
● Cache lines, coherency protocols
● NUMA
● Algorithms are the only thing that matters,
everything else is implementation detail
● Docker

Background - ScyllaDB
● Clustered NoSQL database compatible with
Apache Cassandra
● ~10X performance on same hardware
● Low latency, esp. higher percentiles
● Self tuning
● C++14, fully asynchronous; Seastar!

YCSB Benchmark:
3 node Scylla cluster vs 3, 9, 15, 30
Cassandra machines
3 Scylla
30 Cassandra
3 Cassandra
3 Scylla
30 Cassandra
3 Cassandra

Log-Structured Merge Tree
SStable 1
SStable 2
SStable 3
Time
SStable 4
SStable 5
SStable 1+2+3
Foreground Job Background Job

High-level Goals
● Efficiency:
○ Make the most out of every cycle
● Utilization:
○ Squeeze every cycle from the machine
● Control
○ Spend the cycles on what we want, when we want

Characterizing the problem
● Large numbers of small operations
○ Make coordination cheap
● Lots of communications
○ Within the machine
○ With disk
○ With other machines

● Thread-per-core design
○ Never block
● Asynchronous networking
● Asynchronous file I/O
● Asynchronous multicore
General Architecture

Scylla has its own task scheduler
Traditional stack Scylla’s stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
Context switch cost is
high. Large stacks pollutes
the caches No sharing, millions of
parallel events

Fundamental performance equation
Concurrency = Throughput * Latency

Throughput =
Concurrency
Latency

Latency =
Concurrency
Throughput

Lower bounds for concurrency
● Disks want minimum iodepth for full
throughput (heads/chips)
● Remote nodes need concurrency to hide
network latency and their own min.
concurrency
● Compute wants work for each core

Results of Mathematical Analysis
● Want high concurrency (for throughput)
● Want low concurrency (for latency)
● Resources require concurrency for full
utilization

Sources of concurrency
● Users
○ Reduce concurrency / add nodes
● Internal processes
○ Generate as much concurrency as possible
○ Schedule

Resource Scheduling
Scheduler
Storage
8
User read
User write
Compaction (internal)
Streaming (internal)
30
12
50
50

Why not the Linux I/O scheduler?
● Can only communicate priority by originating
thread
● Will reorder/merge like crazy
● Disable

Figuring out optimal disk
concurrency
Max useful disk
concurrency

Cache design
Cache files or objects?

Using the kernel page cache
● 4k granularity
● Thread-safe
● Synchronous APIs
● General-purpose
● Lack of control (1)
● Lack of control (2)
● Exists
● Hundreds of
hacker-years
● Handling lots of edge
cases

Unified cache
Cassandra Scylla
Key cache
Row cache
On-heap /
Off-heap
Linux page cache
SSTables
Unified cache
SSTables

Workload Conditioning
• Internal feedback loops to balance competing loads
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
Memory
Monitor
Adjust priority
Adjust priority
WAN
CPU

Replacing the system
memory allocator

System memory allocator problems
● Thread safe
● Allocation back pressure

Seastar memory allocator
● Non-Thread safe!
○ Each core gets a private memory pool
● Allocation back pressure
○ Allocator calls a callback when low on memory
○ Scylla evicts cache in response

Remaining problems with
malloc/free
● Memory gets fragmented over time
○ If workload changes sizes of allocated objects
● Allocating a large contiguous block
requires evicting most of cache

Log-structured memory allocation
● The cache
○ Large majority of memory allocated
○ Small subset of allocation sites
● Teach allocator how to move allocated
objects around
○ Updating references

Log-structured memory allocation
Fancy Animation

Userspace TCP/IP stack
● Thread-per-core design
● Use DPDK to drive hardware
● Present as experimental mode
○ Needs more testing and productization

Query Compilation to Native Code
● Use LLVM to JIT-compile CQL queries
● Embed database schema and internal
object layouts into the query

● Full control of the software stack can generate big
payoffs
● Careful system design can maximize throughput
● Without sacrificing latency
● Without requiring endless end-user tuning
● While having a lot of fun
Conclusions

● Download: http://www.scylladb.com
● Twitter: @ScyllaDB
● Source: http://github.com/scylladb/scylla
● Mailing lists: scylladb-user @ groups.google.com
● Company site & blog: http://www.scylladb.com
How to interact

THE SCYLLA IS THE LIMIT
Thank you.

Watch the video with slide synchronization on
InfoQ.com!
https://www.infoq.com/presentations/scylladb

ScyllaDB: Achieving No-compromise Performance

Recomendados

Recomendados

Mais conteúdo relacionado

Destaque

Destaque (14)

Mais de C4Media

Mais de C4Media (20)

Último

Último (20)

ScyllaDB: Achieving No-compromise Performance