Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster

•Download as PPTX, PDF•

1 like•1,697 views

Fetching large amount of data in a single query is a longstanding pain for applications. Queries that return a significant amount of data have to be paged, in other words, split into multiple subqueries that return data little by little. In both Scylla and Apache Cassandra, paging is stateless: each subquery is independent of each other and can even be sent to different replicas. Because of that, all the work done in the previous subqueries will not be reused causing a reduction from the maximum expected throughput. In this talk we are going to examine the problems with the previous stateless paging implementation and introduce the new stateful paging implementations that brings vast improvements in the throughput of large partition scans.

Software

Contents
▪ Problem Statement
▪ Solution
▪ Benchmarking Results

How Paging Worked?
CLIENT SCYLLA
Paging State Cookie
LEGEND

What is Stateless Paging?
Paging State Cookie
Create Query State
Destroy Query State
LEGEND
CLIENT SCYLLA

What is Wrong with Stateless Paging?
▪ Setting up the query state requires a non-trivial amount of work
▪ Relatively cheap for row-cache
▪ Expensive for read-from-disk:
• Identify sstables
• Read summary and index files
• Skip to start position in the sstable

Stateful Paging
CLIENT SCYLLA
Save Query State
Look-up Query State
Query Key
Paging State Cookie
Create Query State
Destroy Query State
LEGEND

What is the query state exactly?
Cluster
Node 0
0 1 2 3 4 5
Node 1
0 1 2 3 4 5
Node 2
0 1 2 3 4 5

Sticky Replicas
▪ Send all page requests to the same set of replicas
▪ Implemented by storing the list of replicas in the
Paging State Cookie
▪ The replicas are chosen on the first page and “stuck to”
for the rest of the query

Querier Cache - Overview
▪ Special-purpose cache
▪ Each shard of each node has one
▪ Entries are saved under the query key
▪ Multiple entries can be inserted with the same key

Querier Cache - Dealing with Failures
▪ Create new querier on miss
▪ Drop found querier and create a new one on:
• Read position mismatch
• Schema version mismatch

Querier Cache - Eviction Policies
▪ Time based
▪ Memory based
▪ Read permit based

Diagnostics
New counters:
▪ Lookups
▪ Misses
▪ Drops
▪ Evictions
▪ Population
New CQL trace messages:
▪ When a querier is looked up
▪ When a querier is cached

Partition Range Scans
CQL
Reads
Disk Bytes
Read/CQL
Read
Disk OPS/
CQL Read

Summary
▪ Better resource utilization
▪ Improved performance
▪ Better handling of large partitions
▪ Especially beneficial for disk-bound setups

More Details
▪ https://www.scylladb.com/2018/07/13/efficient-query-paging/
▪ https://www.scylladb.com/2018/11/01/more-efficient-range-scan-paging-with-scylla-
3-0/
▪ https://www.scylladb.com/users-blog/
▪ https://www.scylladb.com/developers-blog/

Thank You
Any Questions ?
Please stay in touch
bdenes@scylladb.com

What's hot

How to be Successful with ScyllaScyllaDB

Scylla’s Journey Towards Being an Elastic Cloud Native DatabaseScyllaDB

How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB

Lightweight Transactions at Lightning SpeedScyllaDB

Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...ScyllaDB

Introducing Scylla Open Source 4.0ScyllaDB

Scylla Summit 2018: Building Recoverable (and optionally Async) Spark PipelinesScyllaDB

Scylla Summit 2019 Keynote - Avi KivityScyllaDB

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...ScyllaDB

Scylla Summit 2018: Keynote - 4 Years of ScyllaScyllaDB

Powering a Graph Data System with Scylla + JanusGraphScyllaDB

How we got to 1 millisecond latency in 99% under repair, compaction, and flus...ScyllaDB

Scylla Summit 2018: Getting the Most Out of Scylla on KubernetesScyllaDB

Scylla Summit 2016: Graph Processing with Titan and ScyllaScyllaDB

Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair UpdatesScyllaDB

ScyllaDB @ Apache BigData, may 2016Tzach Livyatan

Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in GoScyllaDB

Developing Scylla Applications: Practical TipsScyllaDB

Comparing Apache Cassandra 4.0, 3.0, and ScyllaDBScyllaDB

Scylla Summit 2018: Scylla 3.0 and BeyondScyllaDB

What's hot (20)

How to be Successful with Scylla

Scylla’s Journey Towards Being an Elastic Cloud Native Database

How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night

Lightweight Transactions at Lightning Speed

Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...

Introducing Scylla Open Source 4.0

Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines

Scylla Summit 2019 Keynote - Avi Kivity

AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...

Scylla Summit 2018: Keynote - 4 Years of Scylla

Powering a Graph Data System with Scylla + JanusGraph

How we got to 1 millisecond latency in 99% under repair, compaction, and flus...

Scylla Summit 2018: Getting the Most Out of Scylla on Kubernetes

Scylla Summit 2016: Graph Processing with Titan and Scylla

Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates

ScyllaDB @ Apache BigData, may 2016

Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in Go

Developing Scylla Applications: Practical Tips

Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB

Scylla Summit 2018: Scylla 3.0 and Beyond

Similar to Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster

From HDFS to S3: Migrate Pinterest Apache Spark ClustersDatabricks

Taking Splunk to the Next Level - TechnicalSplunk

Taking Splunk to the Next Level – ArchitectureSplunk

Understanding and Improving Code GenerationDatabricks

MyRocks introduction and production deploymentYoshinori Matsunobu

Scylla Summit 2016: Compose on Containing the DatabaseScyllaDB

Healthcare Claim Reimbursement using Apache SparkDatabricks

MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...MongoDB

Development to Production with Sharded MongoDB ClustersSeveralnines

Taking Splunk to the Next Level - Architecture Breakout SessionSplunk

Cost Effectively Run Multiple Oracle Database Copies at Scale NetApp

Taking Splunk to the Next Level - ArchitectureSplunk

Taking Splunk to the Next Level - Architecture Breakout SessionSplunk

MySQL Conference 2011 -- The Secret Sauce of Sharding -- Ryan Thiessenryanthiessen

To Cloud or Not To Cloud?Greg Lindahl

DevOps for ETL processing at scale with MongoDB, Solr, AWS and ChefGaurav "GP" Pal

stackArmor presentation for DevOpsDC ver 4Gaurav "GP" Pal

MyRocks Deep DiveYoshinori Matsunobu

Performance stackShayne Bartlett

Similar to Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster (20)

From HDFS to S3: Migrate Pinterest Apache Spark Clusters

Taking Splunk to the Next Level - Technical

Taking Splunk to the Next Level – Architecture

Understanding and Improving Code Generation

MyRocks introduction and production deployment

Scylla Summit 2016: Compose on Containing the Database

Healthcare Claim Reimbursement using Apache Spark

MongoDB World 2018: Solving Your Backup Needs Using MongoDB Ops Manager, Clou...

Development to Production with Sharded MongoDB Clusters

Taking Splunk to the Next Level - Architecture Breakout Session

Cost Effectively Run Multiple Oracle Database Copies at Scale

Taking Splunk to the Next Level - Architecture

Taking Splunk to the Next Level - Architecture Breakout Session

MySQL Conference 2011 -- The Secret Sauce of Sharding -- Ryan Thiessen

To Cloud or Not To Cloud?

DevOps for ETL processing at scale with MongoDB, Solr, AWS and Chef

stackArmor presentation for DevOpsDC ver 4

MyRocks Deep Dive

Performance stack

Recently uploaded

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

Right Money Management App For Your Financial GoalsJhone kinadey

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveCall Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

5 Signs You Need a Fashion PLM Software.pdfWave PLM

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Recently uploaded (20)

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

Optimizing AI for immediate response in Smart CCTV

Right Money Management App For Your Financial Goals

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

How To Use Server-Side Rendering with Nuxt.js

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

HR Software Buyers Guide in 2024 - HRSoftware.com

W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Unlocking the Future of AI Agents with Large Language Models

Hand gesture recognition PROJECT PPT.pptx

5 Signs You Need a Fashion PLM Software.pdf

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

A Secure and Reliable Document Management System is Essential.docx

Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster

1. How we made large partition scans over two times faster Botond Denes Software Developer @ ScyllaDB

2. Presenter bio Botond is a software engineer who has worked in a range of roles from web-developer to backend developer in a range of industries from railway automation to finance. He loves programming and solving challenging problems with elegant code, open-source software, Linux, and C++. What he likes best about working here is that Scylla is made up of that entire list.

3. Contents ▪ Problem Statement ▪ Solution ▪ Benchmarking Results

4. Problem Statement

5. What is Paging? CLIENT SCYLLA

6. How Paging Worked? CLIENT SCYLLA Paging State Cookie LEGEND

7. What is Stateless Paging? Paging State Cookie Create Query State Destroy Query State LEGEND CLIENT SCYLLA

8. What is Wrong with Stateless Paging? ▪ Setting up the query state requires a non-trivial amount of work ▪ Relatively cheap for row-cache ▪ Expensive for read-from-disk: • Identify sstables • Read summary and index files • Skip to start position in the sstable

9. The Solution

10. Stateful Paging CLIENT SCYLLA Save Query State Look-up Query State Query Key Paging State Cookie Create Query State Destroy Query State LEGEND

11. What is the query state exactly? Cluster Node 0 0 1 2 3 4 5 Node 1 0 1 2 3 4 5 Node 2 0 1 2 3 4 5

12. Sticky Replicas ▪ Send all page requests to the same set of replicas ▪ Implemented by storing the list of replicas in the Paging State Cookie ▪ The replicas are chosen on the first page and “stuck to” for the rest of the query

13. Querier Cache - Overview ▪ Special-purpose cache ▪ Each shard of each node has one ▪ Entries are saved under the query key ▪ Multiple entries can be inserted with the same key

14. Querier Cache - Dealing with Failures ▪ Create new querier on miss ▪ Drop found querier and create a new one on: • Read position mismatch • Schema version mismatch

15. Querier Cache - Eviction Policies ▪ Time based ▪ Memory based ▪ Read permit based

16. Diagnostics New counters: ▪ Lookups ▪ Misses ▪ Drops ▪ Evictions ▪ Population New CQL trace messages: ▪ When a querier is looked up ▪ When a querier is cached

17. Benchmarking Results

18. Single Partition Scans CQL Reads

19. Partition Range Scans CQL Reads Disk Bytes Read/CQL Read Disk OPS/ CQL Read

20. Summary ▪ Better resource utilization ▪ Improved performance ▪ Better handling of large partitions ▪ Especially beneficial for disk-bound setups

21. More Details ▪ https://www.scylladb.com/2018/07/13/efficient-query-paging/ ▪ https://www.scylladb.com/2018/11/01/more-efficient-range-scan-paging-with-scylla- 3-0/ ▪ https://www.scylladb.com/users-blog/ ▪ https://www.scylladb.com/developers-blog/

22. Thank You Any Questions ? Please stay in touch bdenes@scylladb.com

Editor's Notes

Scylla's query paging. How we improved it. How this benefits large partition scans and partition range scans.
Note that I might use “read” and “query” interchangeably.
Queries can return an unknown amount of data. The exact amount is only known *after* the query has been executed. Reading an unknown amount of data at once is dangerous, can fill memory, hog CPU - can cause service denial. To avoid this Scylla uses paging - that it reads and transmits the results in limited-size chunks, called pages. Pages are limited by the number of rows (10k, changeable by client) and size (1MB - fixed, sanity limit, setting to a high number can lead to service denial). After sending each page to the client, Scylla stops and waits for the client to explicitly request the next page.
A Cookie transmitted on each request - response. This cookie is called the “Paging State”. The Paging State is an opaque binary blob with arbitrary content from the point of view of the client. Scylla can choose what to include in it, its content is not part of the protocol. This provides a certain flexibility for the server in the implementation of paging. In any case Scylla stores just the position - last partition and clustering key. Scylla itself didn’t store any state related to the query.
The internal Query State of Scylla was created anew on the beginning of each page and destroyed at the end. The Query State is an abstract concept that represents all the internal state required to serve the query. Essentially each page is a separate query. This has the advantage of simple code but has many drawbacks.
Scylla had to do this all over again one the start of each page. Scylla doesn’t use the OS page cache, so all this effort is truly lost when the state is destroyed, we don’t get the benefit of the OS having cached the recently read files in RAM for us. Gets worse as the size of the scanned partition increases - hurts large partitions especially bad.
Make paging stateful, that is create the Query State on the first page and use it throughout the entire query. Sounds simple but had a lot of details to get right. On the end of each page we save the Query State in a cache. For this we use a unique key called the Query Key, that is generated on the first page. This key is then remembered by being included in the Paging State Cookie. On the beginning of all subsequent pages we look up the saved Query State and continue the query where we left off.
The Query has a local state on each shard of each node that is involved in the query. In this imaginary cluster of 3 nodes, with 6 shards each, an imaginary query is run on Shard1 of Node0 and Node2. This imaginary query can be a single partition scan, executed with a CL=2. So the query state will be made up from the local state on Shard1 of Node0 and the local state on Shard1 of Node2. We call the local state the Querier, the querier is an actual object that encapsulates all state and logic required to serve the query on a single shard of a single node. So when we are talking about saving the query state, we mean saving the actual querier objects. All this state is located on replicas, no state is stored on the coordinator. Coordinator is the node that receives the request. It’s job is to select the replicas to forward the request to, merging the results and sending it to the client. The replica is the node that actually has the data, and that actually executes the query. The coordinator can be a replica as well, in fact drivers will choose replicas such that this is true.
Since state is local to replicas, we have to use the same set of replicas through the query. This has a side effect: the driver can choose a coordinator that is not one of the replicas previously used for this query - an extra network hop is introduced. Drivers choose a new coordinator for each page for load balancing. This can be fixed by changing the driver to stick to the same coordinator for the entire query. Piotr Jastrzebski talks about this in details in his talk about driver optimizations.
It is the foundation upon which stateful paging is built. When multiple entries have the same key, we distinguish them by their read range - the partition range they are reading. In the case of single partition scans this will be just a single partition. This is possible for IN queries, if two listed partitions are located on the same shard of the same node.
In a perfect world each lookup for a saved querier succeeds and querier can be used to continue the query. We don’t live in a perfect world - a lot can go wrong in a distributed database. A previously used replica can crash or be partitioned - the query has to move to a new one - will miss. It is possible for the lookup to succeed but the querier to be not suitable for continuing the query. It is possible that the page request will want to continue from a position that doesn’t match the cached querier’s. The position of a querier is the position it stopped reading on the previous page and consequently the position it will continue on the next page. This position has row granularity. This can be caused by nodes having mismatching data - read repair. Or a node having been skipped for a few pages - due to partition or slowness. Schema updates can run concurrently with the query - would require complex code to deal with - not worth it, we drop the querier instead.
Abandoned queriers. Can happen for a number of reasons - client crashed, node was partitioned. Each inserted querier has a TTL of 10s. Bound memory consumption. Currently 4% of the shard’s memory. We have a read concurrency control. It is permit based, each new read, that is each new querier has to obtain a permit before it can start reading. Permits are limited. Queriers hold on to their permit for their entire lifetime. It can happen that incoming new reads cannot be started as all permits have run out - evict cached queriers to free up permits.
Misses - number of lookups that failed Drops - number of lookups that succeeded but the querier is not suitable for continuing the query. Hit rate can be derived from these three metrics.
Mostly focused on benchmarking scanning large partitions, read from disk - the use case that suffered the most from stateless paging.
Normalized graph. Focusing on the improvement itself, instead of the actual numbers. Explain BEFORE and AFTER. Amazing almost 2.5X improvement in throughput.
Also normalized graph, showing only the improvements. Improvement in throughput is not as impressive as that of single partition scans. Partition range scans are a lot more complicated, higher CPU cost. Disk is a smaller factor in their performance. We observed the bottleneck moving from the disk to the CPU. We can see that the improvement in disk usage is much more significant Disk is accessed a lot less. We read less bytes. Per CQL read.
Stateful paging achieved: Better (less) resource utilization. Improved performance (throughput). Vastly improved handling of large partitions, a pain point of Scylla’s in the past.
We published two blogpost on this topic, with a lot more details on how all this is implemented. If you are interested in more details then I recommend reading them. Even if you are not interested in more details on paging, I still recommend visiting our blog as it has a lot of other interesting posts
Index Slide

Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster

Similar to Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster

Editor's Notes