2. About me
• Cassandra operator atYahoo! Japan Corp.
• https://issues.apache.org/jira/browse/CASSA
NDRA-5977
3. remark
• This is a summary of following tickets:
– https://issues.apache.org/jira/browse/CASSANDR
A-11206
– https://issues.apache.org/jira/browse/CASSANDR
A-9738
5. High level: read path
Row Cache
Key Cache
SSTables MemTable
1. Check row cache before going to key cache
2. Check the key cache to get the
offsets to data
3. Find the offsets to data and retrieve data
4. Merge data from sstables and memtable
5. Populate row cache with new row returned
http://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutReads.html
6. Pattern 1.The row is in row cache
Partition
Summary
Disk
MemTable
Compression
Offsets
Bloom Filter
Row Cache
Heap Off Heap
Key Cache
Partition
Index
Data
1. read request
2. return row when that is in row cache
7. Pattern 2.The key is in key cache
Partition
Summary
Disk
MemTable
Compression
Offsets
Bloom Filter
Row Cache
Heap Off Heap
Key Cache
Partition
Index
Data
1. read request
2. Check bloom filters 3. Check the partition key is in key cache
4. Find the offset to the result set
5. Access the result set
8. Pattern 3.The key is not cached
Partition
Summary
Disk
MemTable
Compression
Offsets
Bloom Filter
Row Cache
Heap Off Heap
Key Cache
Partition
Index
Data
1. read request
2. Miss -> Check bloom filters
3. Check the partition key is in key cache
4. Miss -> Bsearch the close location of index
5. Disk scan to find the offsets 6. Find the offset into the result set
7. Access the result set
8. Update key cache
10. Partition Index Recap
• http://distributeddatastore.blogspot.jp/2013/08/cassandra-sstable-storage-format.html
11. RowIndexEntry
• Partition size < 64 kb
– RowIndexEntry
• Position
• Seriarized size of data
• Partition size > 64 kb
– IndexedEntry
• Position
• Seriarized size of data
• IndexInfo[]
– Seriarize method
– Offset
– width
– Etc.
Approximation on 16 byte value
1mb : 3kb / > 200 objects
4mb : 11kb / > 800 objects
64mb : 180kb / > 13k objects
512mb : 1.4mb / > 106k objects
12. 3.The key is not cached
Partition
Summary
Disk
MemTable
Compression
Offsets
Bloom Filter
Row Cache
Heap Off Heap
Key Cache
Partition
Index
Data
1. read request
2. Miss -> Check bloom filters
3. Check the partition key is in key cache
4. Miss -> Bsearch the close location of index
5. Disk scan to find the offsets 6. Find the offsets into the result set
7. Access the result set
8. Update key cache
9. GC, GC, GC…
13. Current solution
• If partition size <
column_index_cache_size_in_kb(configurable)
– IndexedEntry is kept on heap
• Otherwise
– Always read from disk when needed
• https://issues.apache.org/jira/browse/CASSANDRA-11206
• https://www.youtube.com/watch?v=qa84vABqftM
14. Other possible solutions
• IndexInfo never be kept on heap
– Read from disk when needed
– degrades performance when small partition is
read
15. Other possible solutions
• Migrate key cache to be fully off heap
– https://issues.apache.org/jira/browse/CASSANDR
A-9738
– Serialization & deserialization cost so much when
large partition is read
• Will Birch help us to solve this problem?
– https://issues.apache.org/jira/browse/CASSANDRA-9754