5. Data Model
• Row-oriented
• Number of columns/Names can differ
name
xyz Paul zip 95123
name
abc Adam zip 94538 sex Male
namenk12 Nitish
6. Read/Write performance
• Write performance : Superfast!!
– Sequential I/O
– In-memory write
– Zero locking
• Point reads : high performant
• Range scans
– Need reverse-key indexes
– Assess the need for range scans (full-table scans)
– Use Netflix Astyanax client library
7. wide-row implementation
• Viewing history
22-
JAN100 json 1-MAR json
24-jan
501 Json
data
25-
jan
json
data
26-jan data
data
name1000 Nitish
28-jan Json
data
29-jan json
data
8. Think Data Archival
• Data stores in Netflix grow exponentially
• Have a process in place to archive data
– Work with Data Science Engineering /DW
– Move data to cheap H/W
– Set right expectations w.r.t latencies with historical data
• Cassandra TTL’s
10. Observations
• Cassandra scales linearly without any noticeable
degradation to running cluster
• Read performance sufficient enough to remove
memcache in some cases
• Self-healing : minimal operational noise
• Developers
– mindset needed a shift from normalization to
denormalization
– Need to have reasonable understanding of Cassandra
architecture
Share some practical data modeling lessons we have learned over past 2 yearsUnderstand your data use patterns and match it to your persistence store at the cost of DE normalizationVery important to spend time and come up appropriate data model – cost is high. Subscriber example
Start with some live example.. And then use it as segway to cover some best practices
Start with some live example.. And then use it as segway to cover some best practices
Rows are indexedColumns are sorted based on comparator you specify, so use it to your benefitKeep column names short as they are repeated Column size = 15 bytes + size of name + size of value Don’t store empty columns if there is no need – schema free design
Cassandra is for point queriesStill ok for small set of rows
We don’t have linear growthTTL fascinating feature… coming from oracle background
We don’t have linear growthTTL fascinating feature… coming from oracle background
gps 1.0
architecture to reap the benefits of distributed computing / high performance