This document discusses Cassandra time series data challenges and Threat Stack's solutions. Typical problems include disks filling quickly and data losing value over time. Threat Stack handles 5-10TB of data daily with 80,000-150,000 transactions per second. They developed their own MTCS compaction strategy and sstablejanitor tool to better handle time series data expiration in Cassandra by unlinking expired SSTables. This allows them to analyze large volumes of real-time security event data cost effectively at scale.
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra (Sam Bisbee, Threat Stack) | C* Summit 2016
1. Terror & Hysteria: Cost Effective
Scaling of Time Series Data
with Cassandra
Sam Bisbee, Threat Stack CTO
2. Typical [time series] problems on C*
â Disk utilization creates a scaling pattern of lighting money on
fire
â Only works for a month or two, even with 90% disk utilization
â Every write up we found focused on schema design for
tracking integers across time
â There are days we wish we only tracked integers
â Data drastically loses value over time, but C*'s design
doesn't acknowledge this
â TTLs only address 0 value states, not partial value
â Ex., 99% of reads are for data in its first day
â Not all sensors are equal
3. Categories of Time Series Data
Volume of Tx's
Size of Tx's
CRUD, Web 2.0
System Monitoring
(CPU, etc.)
System Monitoring
(CPU, etc.)
Traditional object store
Threat Stack
4. Categories of Time Series Data
Volume of Tx's
Size of Tx's
CRUD, Web 2.0
System Monitoring
(CPU, etc.)
System Monitoring
(CPU, etc.)
Traditional object store
Threat Stack
Traditional time
series on C*, what
everyone writes about
âWe're going to need
a bigger boat. Or disks.â
5. We care about this thing called margins
(see: we're in Boston, not the Valley)
6. Data at Threat Stack
â 5 to 10TBs per day of raw data
â Crossed several TB per day in first few months of production with ~4 people
â 80,000 to 150,000 Tx per second, analyzed in real time
â Internal goal of analyzing, persisting, and firing alerts in <1s
â 90% write to 10% read tx
â Pre-compute query results for 70% of queries for UI
â Optimized lookup tables & complex data structures, not just âquery & cacheâ
â 100% AWS, distrust of remote storage in our DNA
â This is not just EBS bashing. This applies to all databases on all platforms,
even a cage in a data center.
â By the way, we're on DSE 4.8.4 (C* 2.1)
7. Generic data model
â Entire platform assumes that events form a partially ordered, eventually
consistent, write ahead log
â A wonderful C* use case, so long as you only INSERT
â UPDATE is a dirty word and C* counters are âbannedâ
â We do our big counts elsewhere (âright tool for the right jobâ)
â No DELETEs, too many key permutations and don't want tombstones
â Duplicate writes will happen
â Legitimate: fully or partially failed batches of writes
â Legitimate: sensor resends data because it doesn't see platform's
acknowledgement of data
â How-do-you-even-computer: people cannot configure NTP, so have fun
constantly receiving data from 1970
â TTL on insert time, store and query on event time
8. We need to show individual events or slices,
cannot use time granularity rows
(1min, 15min, 30min, 1hr, etc.)
9. Creating and updating tables' schema
â ALTER TABLE isn't fun, so we support dual writes instead
â Create new schema, performing dual reads for new & old
â Cut writes over to new schema
â After TTL time, DROP TABLE old
â Each step is verifiable with unit tests and metrics
â Maintains insert only data model for temporary disk util
cost
â Allows trivial testing of analysis and A/B'ing of schema
â Just toss a new schema in, gather some insights, and then
feel free to drop it
10. AWS Instance Types & EBS
â EBS is generally banned on our platform
â Too many of us lived through the great outage
â Too many of us cannot live with unpredictable I/O patterns
â Biggest reason: you cannot RI EBS
â Originally used i2.2xlarge's in 2014/2015
â Considering amount of âlearningâ we did, we were very
grateful for SSDs due to amount of streaming we had to do
â Moved to d2.xlarge's and d2.2xlarge's in 2015
â RAID 0 the spindles with xfs
â We like the CPU and RAM to disk ratio, especially since
compaction stops after a few hours
11. $/TB on AWS
i2.2xlarge d2.2xlarge c3.2xlarge +
6 x 2TB io1 EBS
No Prepay $619.04 / 1.6TB
= $386 / TB / month
$586.92 / 12TB
= $49.91 / TB / month
$1,713.16 / 12TB
= $142.77/TB/month
Partial Prepay $530.37 / 1.6TB
= $331.48/TB/month
$502.12 / 12TB
= $41.85 / TB / month
$1,684.59 / 12TB
= $140.39/TB/month
Full Prepay $519.17 / 1.6TB
= $324.85/TB/month
$492 / 12TB
= $41 / TB / month
$1,680.84 / 12TB
= $140.07/TB/month
â Amortizes one-time RI across 1yr, focusing on cost instead of cash out of
pocket
â Does not account for N=3 in cluster, so x3 for each record, then x2 for worst
case compaction headroom (realistically need MUCH LESS)
â c3 column assumes d2 comparison on disk size, not fair versus i2
12. We only store some raw data in C*
â Deleting data proved too difficult in the early days, even
with DTCS (slides coming on how we solved this)
â Re-streaming due to regular maintenance could take a
week or more
â Dropping instance size doesn't solve throughput problem
since all resources are cut, not just disk size
â Another reason not to use EBS since you'll âneverâ get close
to 100% disk utilization
â Due to aforementioned C* durability design, cost of data
for day 2..N is too high even if you drop replica count
13. Tying C* to raw data
â Every query must constrain a minimum of:
â Sensor ID
â Event Day
â Every query result must include a minimum of:
â Sensor ID
â Event Day
â Event ID
â Batches of (sensor_id, event_day, event_id) triples are
then used to look up the raw events from raw data storage
â This isn't always necessary (aggregates, correlations, etc.)
â Even with additional hops, full reads are still <1s
14. Using triples to batch writes
â Partition key starts with sensor id and event day
â Bonus: you get fresh ring location every day! Helps for
averaging out your schema mistakes over the TTL
â Event batches off of RabbitMQ are already constrained to
a single sensor id and event day
â Allows mapping a single AMQP read to a single C* write
(RabbitMQ is podded, not clustered)
â Flow state of pipeline becomes trivial to understand
â Batch C* writes on partition key, then data size (soft cap at
5120 bytes, C* inner warn)
15. Compaction woes, STCS & DTCS
â Used STCS in 2014/2015, expired data would get stuck â
â âWe could rotate tablesâ â eh, no
â âWe could rotate clustersâ â oh c'mon, hell no
â âWe could generate every historic permutation of keys within
that time bucket with Spark and run DELETEsâ â...............
â Used DTCS in 2015, but expired data still got stuck â
â When deciding whether an SSTable is too old to compact,
compares ânowâ versus max timestamp (most recent write)
â If you write constantly (time series), then SSTables will rarely
or never stop compacting
â This means that you never realize the true value of DTCS for
time series, the ability to unlink whole SSTables from disk
16. Cluster disk states assuming const sensor count
Disk Util
Time
What you want
What you get
Initial build up to
retention period
18. MTCS settings
â Never run repairs (never worked on STCS or DTCS anyway)
and hinted handoff is off (great way to kill a cluster anyway)
â max_sstable_age_days = 1
base_time_seconds = 1 hour
â Results in roughly hour bucket sequential SSTables
â Reads are happy due to day or hour resolution, which have to
provide this in the partition key anyway
â Rest of DTCS sub-properties are default
â Not worried about really old and small SSTables since those
are simply unlinked âsoonâ
19. MTCS + sstablejanitor.sh
â Even with MTCS, SSTables were still not getting unlinked
â So enters sstablejanitor.sh
â Cron job fires it once per hour
â Iterates over each SSTable on disk for MTCS tables (chef/cron
feeds it a list of tables and their TTLs)
â Uses sstablemetadata to determine max timestamp
â If past TTL, then uses JMX to invoke CompactionManager's
forceUserDefinedCompaction on the table
â Hack? Yes, cron + sed + awk + JMX qualifies as a hack, but
it works like a charm and we don't carry expired data
â Bonus: don't need to reserve half your disks for compaction