Fetching large amount of data in a single query is a longstanding pain for applications. Queries that return a significant amount of data have to be paged, in other words, split into multiple subqueries that return data little by little. In both Scylla and Apache Cassandra, paging is stateless: each subquery is independent of each other and can even be sent to different replicas. Because of that, all the work done in the previous subqueries will not be reused causing a reduction from the maximum expected throughput. In this talk we are going to examine the problems with the previous stateless paging implementation and introduce the new stateful paging implementations that brings vast improvements in the throughput of large partition scans.
A Secure and Reliable Document Management System is Essential.docx
Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster
1. How we made large partition
scans over two times faster
Botond Denes
Software Developer @ ScyllaDB
2. Presenter bio
Botond is a software engineer who has worked in a range of
roles from web-developer to backend developer in a range of
industries from railway automation to finance. He loves
programming and solving challenging problems with elegant
code, open-source software, Linux, and C++. What he likes best
about working here is that Scylla is made up of that entire list.
7. What is Stateless Paging?
Paging State Cookie
Create Query State
Destroy Query State
LEGEND
CLIENT SCYLLA
8. What is Wrong with Stateless Paging?
▪ Setting up the query state requires a non-trivial amount of work
▪ Relatively cheap for row-cache
▪ Expensive for read-from-disk:
• Identify sstables
• Read summary and index files
• Skip to start position in the sstable
10. Stateful Paging
CLIENT SCYLLA
Save Query State
Look-up Query State
Query Key
Paging State Cookie
Create Query State
Destroy Query State
LEGEND
11. What is the query state exactly?
Cluster
Node 0
0 1 2 3 4 5
Node 1
0 1 2 3 4 5
Node 2
0 1 2 3 4 5
12. Sticky Replicas
▪ Send all page requests to the same set of replicas
▪ Implemented by storing the list of replicas in the
Paging State Cookie
▪ The replicas are chosen on the first page and “stuck to”
for the rest of the query
13. Querier Cache - Overview
▪ Special-purpose cache
▪ Each shard of each node has one
▪ Entries are saved under the query key
▪ Multiple entries can be inserted with the same key
14. Querier Cache - Dealing with Failures
▪ Create new querier on miss
▪ Drop found querier and create a new one on:
• Read position mismatch
• Schema version mismatch
15. Querier Cache - Eviction Policies
▪ Time based
▪ Memory based
▪ Read permit based
16. Diagnostics
New counters:
▪ Lookups
▪ Misses
▪ Drops
▪ Evictions
▪ Population
New CQL trace messages:
▪ When a querier is looked up
▪ When a querier is cached
Scylla's query paging.
How we improved it.
How this benefits large partition scans and partition range scans.
Note that I might use “read” and “query” interchangeably.
Queries can return an unknown amount of data.
The exact amount is only known *after* the query has been executed.
Reading an unknown amount of data at once is dangerous, can fill memory, hog CPU - can cause service denial.
To avoid this Scylla uses paging - that it reads and transmits the results in limited-size chunks, called pages.
Pages are limited by the number of rows (10k, changeable by client) and size (1MB - fixed, sanity limit, setting to a high number can lead to service denial).
After sending each page to the client, Scylla stops and waits for the client to explicitly request the next page.
A Cookie transmitted on each request - response.
This cookie is called the “Paging State”.
The Paging State is an opaque binary blob with arbitrary content from the point of view of the client.
Scylla can choose what to include in it, its content is not part of the protocol.
This provides a certain flexibility for the server in the implementation of paging.
In any case Scylla stores just the position - last partition and clustering key.
Scylla itself didn’t store any state related to the query.
The internal Query State of Scylla was created anew on the beginning of each page and destroyed at the end.
The Query State is an abstract concept that represents all the internal state required to serve the query.
Essentially each page is a separate query.
This has the advantage of simple code but has many drawbacks.
Scylla had to do this all over again one the start of each page.
Scylla doesn’t use the OS page cache, so all this effort is truly lost when the state is destroyed, we don’t get the benefit of the OS having cached the recently read files in RAM for us.
Gets worse as the size of the scanned partition increases - hurts large partitions especially bad.
Make paging stateful, that is create the Query State on the first page and use it throughout the entire query.
Sounds simple but had a lot of details to get right.
On the end of each page we save the Query State in a cache.
For this we use a unique key called the Query Key, that is generated on the first page.
This key is then remembered by being included in the Paging State Cookie.
On the beginning of all subsequent pages we look up the saved Query State and continue the query where we left off.
The Query has a local state on each shard of each node that is involved in the query.
In this imaginary cluster of 3 nodes, with 6 shards each, an imaginary query is run on Shard1 of Node0 and Node2.
This imaginary query can be a single partition scan, executed with a CL=2.
So the query state will be made up from the local state on Shard1 of Node0 and the local state on Shard1 of Node2.
We call the local state the Querier, the querier is an actual object that encapsulates all state and logic required to serve the query on a single shard of a single node.
So when we are talking about saving the query state, we mean saving the actual querier objects.
All this state is located on replicas, no state is stored on the coordinator.
Coordinator is the node that receives the request. It’s job is to select the replicas to forward the request to, merging the results and sending it to the client.
The replica is the node that actually has the data, and that actually executes the query.
The coordinator can be a replica as well, in fact drivers will choose replicas such that this is true.
Since state is local to replicas, we have to use the same set of replicas through the query.
This has a side effect: the driver can choose a coordinator that is not one of the replicas previously used for this query - an extra network hop is introduced.
Drivers choose a new coordinator for each page for load balancing.
This can be fixed by changing the driver to stick to the same coordinator for the entire query.
Piotr Jastrzebski talks about this in details in his talk about driver optimizations.
It is the foundation upon which stateful paging is built.
When multiple entries have the same key, we distinguish them by their read range - the partition range they are reading.
In the case of single partition scans this will be just a single partition.
This is possible for IN queries, if two listed partitions are located on the same shard of the same node.
In a perfect world each lookup for a saved querier succeeds and querier can be used to continue the query.
We don’t live in a perfect world - a lot can go wrong in a distributed database.
A previously used replica can crash or be partitioned - the query has to move to a new one - will miss.
It is possible for the lookup to succeed but the querier to be not suitable for continuing the query.
It is possible that the page request will want to continue from a position that doesn’t match the cached querier’s.
The position of a querier is the position it stopped reading on the previous page and consequently the position it will continue on the next page.
This position has row granularity.
This can be caused by nodes having mismatching data - read repair.
Or a node having been skipped for a few pages - due to partition or slowness.
Schema updates can run concurrently with the query - would require complex code to deal with - not worth it, we drop the querier instead.
Abandoned queriers. Can happen for a number of reasons - client crashed, node was partitioned.
Each inserted querier has a TTL of 10s.
Bound memory consumption. Currently 4% of the shard’s memory.
We have a read concurrency control. It is permit based, each new read, that is each new querier has to obtain a permit before it can start reading.
Permits are limited.
Queriers hold on to their permit for their entire lifetime.
It can happen that incoming new reads cannot be started as all permits have run out - evict cached queriers to free up permits.
Misses - number of lookups that failed
Drops - number of lookups that succeeded but the querier is not suitable for continuing the query.
Hit rate can be derived from these three metrics.
Mostly focused on benchmarking scanning large partitions, read from disk - the use case that suffered the most from stateless paging.
Normalized graph.
Focusing on the improvement itself, instead of the actual numbers.
Explain BEFORE and AFTER.
Amazing almost 2.5X improvement in throughput.
Also normalized graph, showing only the improvements.
Improvement in throughput is not as impressive as that of single partition scans.
Partition range scans are a lot more complicated, higher CPU cost.
Disk is a smaller factor in their performance.
We observed the bottleneck moving from the disk to the CPU.
We can see that the improvement in disk usage is much more significant
Disk is accessed a lot less.
We read less bytes.
Per CQL read.
Stateful paging achieved:
Better (less) resource utilization.
Improved performance (throughput).
Vastly improved handling of large partitions, a pain point of Scylla’s in the past.
We published two blogpost on this topic, with a lot more details on how all this is implemented.
If you are interested in more details then I recommend reading them.
Even if you are not interested in more details on paging, I still recommend visiting our blog as it has a lot of other interesting posts