SlideShare uma empresa Scribd logo
1 de 47
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
scylladb
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
ScyllaDB: Achieving No-Compromise
Performance
Avi Kivity, CTO
@AviKivity
(Hiring!)
Agenda
Background
Goals
Methods
Conclusion
Non-Agenda
● Docker
● Microservices
● Node.js
● Docker
● Orchestration
● JVM GC Tuning
● JSON over HTTP
● Docker
More Non-Agenda
● Cache lines, coherency protocols
● NUMA
● Algorithms are the only thing that matters,
everything else is implementation detail
● Docker
Background - ScyllaDB
● Clustered NoSQL database compatible with
Apache Cassandra
● ~10X performance on same hardware
● Low latency, esp. higher percentiles
● Self tuning
● C++14, fully asynchronous; Seastar!
YCSB Benchmark:
3 node Scylla cluster vs 3, 9, 15, 30
Cassandra machines
3 Scylla
30 Cassandra
3 Cassandra
3 Scylla
30 Cassandra
3 Cassandra
Log-Structured Merge Tree
SStable 1
SStable 2
SStable 3
Time
SStable 4
SStable 5
SStable 1+2+3
Foreground Job Background Job
High-level Goals
● Efficiency:
○ Make the most out of every cycle
● Utilization:
○ Squeeze every cycle from the machine
● Control
○ Spend the cycles on what we want, when we want
Characterizing the problem
● Large numbers of small operations
○ Make coordination cheap
● Lots of communications
○ Within the machine
○ With disk
○ With other machines
Asynchrony,
Everywhere
● Thread-per-core design
○ Never block
● Asynchronous networking
● Asynchronous file I/O
● Asynchronous multicore
General Architecture
Scylla has its own task scheduler
Traditional stack Scylla’s stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a
pointer to
eventually
computed value
Task is a
pointer to a
lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread is a
function pointer
Stack is a byte
array from 64k
to megabytes
Context switch cost is
high. Large stacks pollutes
the caches No sharing, millions of
parallel events
The Concurrency
Dilemma
Fundamental performance equation
Concurrency = Throughput * Latency
Fundamental performance equation
Throughput =
Concurrency
Latency
Fundamental performance equation
Latency =
Concurrency
Throughput
Lower bounds for concurrency
● Disks want minimum iodepth for full
throughput (heads/chips)
● Remote nodes need concurrency to hide
network latency and their own min.
concurrency
● Compute wants work for each core
Results of Mathematical Analysis
● Want high concurrency (for throughput)
● Want low concurrency (for latency)
● Resources require concurrency for full
utilization
Sources of concurrency
● Users
○ Reduce concurrency / add nodes
● Internal processes
○ Generate as much concurrency as possible
○ Schedule
Resource Scheduling
Scheduler
Storage
8
User read
User write
Compaction (internal)
Streaming (internal)
30
12
50
50
Why not the Linux I/O scheduler?
● Can only communicate priority by originating
thread
● Will reorder/merge like crazy
● Disable
Figuring out optimal disk
concurrency
Max useful disk
concurrency
Cache design
Cache files or objects?
Using the kernel page cache
● 4k granularity
● Thread-safe
● Synchronous APIs
● General-purpose
● Lack of control (1)
● Lack of control (2)
● Exists
● Hundreds of
hacker-years
● Handling lots of edge
cases
Unified cache
Cassandra Scylla
Key cache
Row cache
On-heap /
Off-heap
Linux page cache
SSTables
Unified cache
SSTables
Workload Conditioning
Workload Conditioning
• Internal feedback loops to balance competing loads
Memtable
Seastar
Scheduler
Compaction
Query
Repair
Commitlog
SSD
Compaction
Backlog
Monitor
Memory
Monitor
Adjust priority
Adjust priority
WAN
CPU
Replacing the system
memory allocator
System memory allocator problems
● Thread safe
● Allocation back pressure
Seastar memory allocator
● Non-Thread safe!
○ Each core gets a private memory pool
● Allocation back pressure
○ Allocator calls a callback when low on memory
○ Scylla evicts cache in response
One allocator
is not enough
Remaining problems with
malloc/free
● Memory gets fragmented over time
○ If workload changes sizes of allocated objects
● Allocating a large contiguous block
requires evicting most of cache
OOM :(Memory
Log-structured memory allocation
● The cache
○ Large majority of memory allocated
○ Small subset of allocation sites
● Teach allocator how to move allocated
objects around
○ Updating references
Log-structured memory allocation
Fancy Animation
Future Improvements
Userspace TCP/IP stack
● Thread-per-core design
● Use DPDK to drive hardware
● Present as experimental mode
○ Needs more testing and productization
Query Compilation to Native Code
● Use LLVM to JIT-compile CQL queries
● Embed database schema and internal
object layouts into the query
● Full control of the software stack can generate big
payoffs
● Careful system design can maximize throughput
● Without sacrificing latency
● Without requiring endless end-user tuning
● While having a lot of fun
Conclusions
● Download: http://www.scylladb.com
● Twitter: @ScyllaDB
● Source: http://github.com/scylladb/scylla
● Mailing lists: scylladb-user @ groups.google.com
● Company site & blog: http://www.scylladb.com
How to interact
THE SCYLLA IS THE LIMIT
Thank you.
Watch the video with slide synchronization on
InfoQ.com!
https://www.infoq.com/presentations/scylladb

Mais conteúdo relacionado

Destaque

Advanced portfolio progress sheet 11 june
Advanced portfolio progress sheet 11 juneAdvanced portfolio progress sheet 11 june
Advanced portfolio progress sheet 11 june
jphibbert
 

Destaque (14)

El franquismo
El franquismoEl franquismo
El franquismo
 
A origem do bem e do mal
A origem do bem e do malA origem do bem e do mal
A origem do bem e do mal
 
Seastar @ SF/BA C++UG
Seastar @ SF/BA C++UGSeastar @ SF/BA C++UG
Seastar @ SF/BA C++UG
 
ScyllaDB @ Apache BigData, may 2016
ScyllaDB @ Apache BigData, may 2016ScyllaDB @ Apache BigData, may 2016
ScyllaDB @ Apache BigData, may 2016
 
Scylla: 1 Million CQL operations per second per server
Scylla: 1 Million CQL operations per second per serverScylla: 1 Million CQL operations per second per server
Scylla: 1 Million CQL operations per second per server
 
Scylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDSScylla Summit 2016: Scylla at Samsung SDS
Scylla Summit 2016: Scylla at Samsung SDS
 
Back to the future with C++ and Seastar
Back to the future with C++ and SeastarBack to the future with C++ and Seastar
Back to the future with C++ and Seastar
 
OSv – The OS designed for the Cloud
OSv – The OS designed for the CloudOSv – The OS designed for the Cloud
OSv – The OS designed for the Cloud
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 
OSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear ofOSv: probably the best OS for cloud workloads you've never hear of
OSv: probably the best OS for cloud workloads you've never hear of
 
OSv presentation from Linux Foundation Collaboration Summit
OSv presentation from Linux Foundation Collaboration SummitOSv presentation from Linux Foundation Collaboration Summit
OSv presentation from Linux Foundation Collaboration Summit
 
Katy
KatyKaty
Katy
 
Advanced portfolio progress sheet 11 june
Advanced portfolio progress sheet 11 juneAdvanced portfolio progress sheet 11 june
Advanced portfolio progress sheet 11 june
 
Competitor beer brand in thunderbolt
Competitor beer brand in thunderboltCompetitor beer brand in thunderbolt
Competitor beer brand in thunderbolt
 

Mais de C4Media

Mais de C4Media (20)

Streaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live VideoStreaming a Million Likes/Second: Real-Time Interactions on Live Video
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
 
Next Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy MobileNext Generation Client APIs in Envoy Mobile
Next Generation Client APIs in Envoy Mobile
 
Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020Software Teams and Teamwork Trends Report Q1 2020
Software Teams and Teamwork Trends Report Q1 2020
 
Understand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java ApplicationsUnderstand the Trade-offs Using Compilers for Java Applications
Understand the Trade-offs Using Compilers for Java Applications
 
Kafka Needs No Keeper
Kafka Needs No KeeperKafka Needs No Keeper
Kafka Needs No Keeper
 
High Performing Teams Act Like Owners
High Performing Teams Act Like OwnersHigh Performing Teams Act Like Owners
High Performing Teams Act Like Owners
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 

ScyllaDB: Achieving No-compromise Performance

  • 1.
  • 2. InfoQ.com: News & Community Site • Over 1,000,000 software developers, architects and CTOs read the site world- wide every month • 250,000 senior developers subscribe to our weekly newsletter • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • 2 dedicated podcast channels: The InfoQ Podcast, with a focus on Architecture and The Engineering Culture Podcast, with a focus on building • 96 deep dives on innovative topics packed as downloadable emags and minibooks • Over 40 new content items per week Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ scylladb
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. ScyllaDB: Achieving No-Compromise Performance Avi Kivity, CTO @AviKivity (Hiring!)
  • 6. Non-Agenda ● Docker ● Microservices ● Node.js ● Docker ● Orchestration ● JVM GC Tuning ● JSON over HTTP ● Docker
  • 7. More Non-Agenda ● Cache lines, coherency protocols ● NUMA ● Algorithms are the only thing that matters, everything else is implementation detail ● Docker
  • 8. Background - ScyllaDB ● Clustered NoSQL database compatible with Apache Cassandra ● ~10X performance on same hardware ● Low latency, esp. higher percentiles ● Self tuning ● C++14, fully asynchronous; Seastar!
  • 9. YCSB Benchmark: 3 node Scylla cluster vs 3, 9, 15, 30 Cassandra machines 3 Scylla 30 Cassandra 3 Cassandra 3 Scylla 30 Cassandra 3 Cassandra
  • 10. Log-Structured Merge Tree SStable 1 SStable 2 SStable 3 Time SStable 4 SStable 5 SStable 1+2+3 Foreground Job Background Job
  • 11. High-level Goals ● Efficiency: ○ Make the most out of every cycle ● Utilization: ○ Squeeze every cycle from the machine ● Control ○ Spend the cycles on what we want, when we want
  • 12. Characterizing the problem ● Large numbers of small operations ○ Make coordination cheap ● Lots of communications ○ Within the machine ○ With disk ○ With other machines
  • 14.
  • 15.
  • 16. ● Thread-per-core design ○ Never block ● Asynchronous networking ● Asynchronous file I/O ● Asynchronous multicore General Architecture
  • 17. Scylla has its own task scheduler Traditional stack Scylla’s stack Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise Task Promise Task Promise Task Promise Task CPU Promise is a pointer to eventually computed value Task is a pointer to a lambda function Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Scheduler CPU Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread Stack Thread is a function pointer Stack is a byte array from 64k to megabytes Context switch cost is high. Large stacks pollutes the caches No sharing, millions of parallel events
  • 21. Fundamental performance equation Latency = Concurrency Throughput
  • 22. Lower bounds for concurrency ● Disks want minimum iodepth for full throughput (heads/chips) ● Remote nodes need concurrency to hide network latency and their own min. concurrency ● Compute wants work for each core
  • 23. Results of Mathematical Analysis ● Want high concurrency (for throughput) ● Want low concurrency (for latency) ● Resources require concurrency for full utilization
  • 24. Sources of concurrency ● Users ○ Reduce concurrency / add nodes ● Internal processes ○ Generate as much concurrency as possible ○ Schedule
  • 25. Resource Scheduling Scheduler Storage 8 User read User write Compaction (internal) Streaming (internal) 30 12 50 50
  • 26. Why not the Linux I/O scheduler? ● Can only communicate priority by originating thread ● Will reorder/merge like crazy ● Disable
  • 27. Figuring out optimal disk concurrency Max useful disk concurrency
  • 29. Using the kernel page cache ● 4k granularity ● Thread-safe ● Synchronous APIs ● General-purpose ● Lack of control (1) ● Lack of control (2) ● Exists ● Hundreds of hacker-years ● Handling lots of edge cases
  • 30. Unified cache Cassandra Scylla Key cache Row cache On-heap / Off-heap Linux page cache SSTables Unified cache SSTables
  • 32. Workload Conditioning • Internal feedback loops to balance competing loads Memtable Seastar Scheduler Compaction Query Repair Commitlog SSD Compaction Backlog Monitor Memory Monitor Adjust priority Adjust priority WAN CPU
  • 34. System memory allocator problems ● Thread safe ● Allocation back pressure
  • 35. Seastar memory allocator ● Non-Thread safe! ○ Each core gets a private memory pool ● Allocation back pressure ○ Allocator calls a callback when low on memory ○ Scylla evicts cache in response
  • 37. Remaining problems with malloc/free ● Memory gets fragmented over time ○ If workload changes sizes of allocated objects ● Allocating a large contiguous block requires evicting most of cache
  • 39. Log-structured memory allocation ● The cache ○ Large majority of memory allocated ○ Small subset of allocation sites ● Teach allocator how to move allocated objects around ○ Updating references
  • 42. Userspace TCP/IP stack ● Thread-per-core design ● Use DPDK to drive hardware ● Present as experimental mode ○ Needs more testing and productization
  • 43. Query Compilation to Native Code ● Use LLVM to JIT-compile CQL queries ● Embed database schema and internal object layouts into the query
  • 44. ● Full control of the software stack can generate big payoffs ● Careful system design can maximize throughput ● Without sacrificing latency ● Without requiring endless end-user tuning ● While having a lot of fun Conclusions
  • 45. ● Download: http://www.scylladb.com ● Twitter: @ScyllaDB ● Source: http://github.com/scylladb/scylla ● Mailing lists: scylladb-user @ groups.google.com ● Company site & blog: http://www.scylladb.com How to interact
  • 46. THE SCYLLA IS THE LIMIT Thank you.
  • 47. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/scylladb