SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
Time
Believing Cassandra
Timeli.io’s Big-Data journey to enlightenment under the C* Paradigm
Keith Nordstrom, PhD.
CTO and Co-Founder, Timeli.io
2	
  
Company
§  Founded in 2013
§  Based in Boulder, CO & Sunnyvale, CA

Product/Business
§  Predictive asset analytics solutions
§  Operational applications for connected equipment
Technology Platform
§  Time series data and analytics platform
§  Proprietary time series data processing layer
§  Leverages “best of breed” open source software
Industry Verticals
§  Oil & Gas
§  Manufacturing
§  Utilities – Electric, Gas & Water
Company Overview
Who are we to talk?
²  Time Series data ingestion engine, platform, predictive analytics
²  Validation, Estimation, Regularization
²  Aggregations (ie. Coarse Graining)
²  Based on Utilities software started in Europe in 2009
²  Added Cassandra to stack in 2011
Timeli.io
I started in late 2013 and discovered quickly something they had missed:
Cassandra can be hard to do right
Timeli Architecture
But first …
Cassandra:

²  Sister to Helen of Troy
²  More beautiful, more sought after, wiser
²  Even the gods themselves
²  Promised a wild night to Apollo for power of prophecy
²  Reneged
²  Apollo left her with prophecy, but made it so nobody
believed her
… a minor cultural digression
Moral: Cassandra accurately predicted the Fall of Troy.
Just like Cassandra of legend …
… real-life Cassandra difficult to “believe”
²  Selects designed beforehand
²  Denormalization
²  Many arcane configuration options
²  Hard to find expertise
²  Based on “tables” but not tabular
²  CQL looks like SQL. It’s not SQL.
“No indexed columns present in by-columns clause with Equal operator”
“ORDER BY is only supported when the partition key is restricted by an EQ or an IN”
“PRIMARY KEY column ‘timestamp’ cannot be restricted”
“Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.”
What did this mean for Timeli?
Example: Timeli ingests data, writes to raw, writes to processed, then coarse grains 1 or more series into “aggregations.”
Multiple very competent RDBMS/Java/JPA architects built a time series app where the following 
could not be done:
SELECT * FROM aggregations where meter_id=4bbedd76-4e9e-11e5-885d-feff819cdc9f 

AND timestamp > 2013-01-01 AND timestamp < 2013-03-01;
Early Warning
Aggregations, the primary product:
“It’s a security feature! You have to know when your data exists to get your data!”
Cassandra isn’t crazy
New Beginnings
Out of all of this, Timeli was born
What did we change?
1.  Partitioner
2.  Primary Keys and Row Keys
3.  Performance/Missing data in Collection types
4.  Batching for “Performance”
5.  Double Precision vs. BigDecimal
6.  QueryBuilder vs Prepared Statements
7.  Row Limits
1. The Partitioner
What is a partitioner in Cassandra?
Data
Cassandra Ring
²  Byte Ordered Partitioner
²  Random Partitioner
²  Mumur3 Partitioner
Three Types:
B …
S …
S …
S …
Z …
T …
1. The Partitioner
What is a partitioner in Cassandra?
S …
S …
S …
T …
B …
Data
Z …
Cassandra Ring
²  Byte Ordered Partitioner
²  Random Partitioner
²  Mumur3 Partitioner
Three Types:
1. The Partitioner
What is a partitioner in Cassandra?
S …
T …
B …
S …
Data
Z …
S …
Cassandra Ring
²  Byte Ordered Partitioner
²  Random Partitioner
²  Mumur3 Partitioner
Three Types:
Murmur3 is a random partitioner as well but faster
1. The Partitioner
What is a partitioner in Cassandra?
S …
T …
B …
S …
Data
Z …
S …
Cassandra Ring
²  Byte Ordered Partitioner
²  Random Partitioner
²  Mumur3 Partitioner
Three Types:
²  Our partition keys were of form {UUID}|{string key}
²  UUID 1s are uniformly distributed but keys are not
²  ByteOrderedPartitioner left big gaps:
> nodetool status ts
Datacenter: us-central1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.82.79.110 4.27 GB 256 44.9% 29d0a723-fc1f-4f73-a864-97dc6df045f5 b
UN 10.105.185.1 2.51 GB 256 26.4% 1d236bd9-5fb1-4423-83bc-168bac924db4 b
UN 10.234.92.2 2.73 GB 256 28.7% 29e1358a-bef2-495e-80bc-3de4c4499790 b
1. The Partitioner
Moral: Read the manual. Odds are you won’t think of consequences on your own.
2. Primary Keys and Row Keys
Aggregation Table
A coarse graining of a time series into measures on buckets of larger size than original time resolution
0
5
10 T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
T24
Original
Original
0
5
10
T1 T8 T9 T16 T24 T24
8-Hour Mean
8-Hour Max
8-Hour Min
2. Primary Keys and Row Keys
Original Persistence Model
Aggrega&on_ID	
   Index	
   Period	
   Count	
   Sum	
   Average	
   Max	
   Min	
   Measurements	
  
UUID	
   Long	
   DateTime	
   Long	
   Double	
   Double	
   Double	
   Double	
   Map<DateTime,	
  Double>	
  
²  Aggregation_ID: UUID/ identifier associated with aggregation metadata
²  Period: DateTime of start of aggregation
²  Index: Offset from DateTime of fixed aggregation bucket
²  Count, Sum, Average, Max, Min: values of aggregation on the bucket
²  Measurements: map of all measurements included in the system
PRIMARY KEY (Aggregation_ID, Index)
2. Primary Keys and Row Keys
Original Persistence Model, Storage Representation
Aggregation_ID
Index 1 Index 2 Index 3 …
Period,
Count, etc.
Period,
Count, etc.
Period,
Count, etc.
Index N
Period,
Count, etc.
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
 ✔
2. Primary Keys and Row Keys
Original Persistence Model, Storage Representation
Aggregation_ID
Index 1 Index 2 Index 3 …
Period,
Count, etc.
Period,
Count, etc.
Period,
Count, etc.
Index N
Period,
Count, etc.
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
AND Index = 1
✔
✔
2. Primary Keys and Row Keys
Original Persistence Model, Storage Representation
Aggregation_ID
Index 1 Index 2 Index 3 …
Period,
Count, etc.
Period,
Count, etc.
Period,
Count, etc.
Index N
Period,
Count, etc.
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
AND Index = 1
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
AND Index >= 1 AND Index < 3
✔
✔
✔
2. Primary Keys and Row Keys
Original Persistence Model, Storage Representation
Aggregation_ID
Index 1 Index 2 Index 3 …
Period,
Count, etc.
Period,
Count, etc.
Period,
Count, etc.
Index N
Period,
Count, etc.
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
AND Index = 1
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
AND Index >= 1 AND Index < 3
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287
AND Index = 1 AND Period > 2015-01-01 AND Period < 2015-02-01
✔
✔
✔
✖
2. Primary Keys and Row Keys
Fixed Persistence Model
Aggrega&on_ID	
   StartDate	
   Count	
   Sum	
   Average	
   Max	
   Min	
   Measurements	
  
UUID	
   Timestamp	
   Long	
   Double	
   Double	
   Double	
   Double	
   Map<Timestamp,	
  Double>	
  
PRIMARY KEY (Aggregation_ID, StartDate)
²  Index column not required
²  Primary key allows row key and clustering
Aggregation_ID
2015-01-01 2015-01-02 2015-01-03 …
Count, etc. Count, etc. Count, etc.
2015-12-31
Count, etc.
²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 AND Index = 1
AND Period > 2015-01-01 AND Period < 2015-02-01
✔
2015-01-31
Count, etc.
…
2. Primary Keys and Row Keys
Moral: Consider which queries you need to make and design around them
3. Performance/Missing data in Collection types
Collections in C*
²  C*: supposed to denormalize data
²  Measurements arriving to be included in aggregation
²  How to be sure they’re included?
²  Keep copy
Rationale
Aggrega&on_ID	
   StartDate	
   Count	
   Sum	
   Average	
   Max	
   Min	
   Measurements	
  
UUID	
   Timestamp	
   Long	
   Double	
   Double	
   Double	
   Double	
   Map<Timestamp,	
  Double>	
  
3. Performance/Missing data in Collection types
Collections in C*
²  C*: supposed to denormalize data
²  Measurements arriving to be included in aggregation
²  How to be sure they’re included?
²  Keep copy
Rationale
Downsides
²  Lots of storage space – do we really need value?
²  In < 2.1, performance implications (serialization)
²  All values returned
²  64K limit! modulus => missing data
Aggrega&on_ID	
   StartDate	
   Count	
   Sum	
   Average	
   Max	
   Min	
   Measurements	
  
UUID	
   Timestamp	
   Long	
   Double	
   Double	
   Double	
   Double	
   Map<Timestamp,	
  Double>	
  
3. Performance/Missing data in Collection types
Collections in C*
Aggrega&on_ID	
   StartDate	
   Count	
   Sum	
   Average	
   Max	
   Min	
   Measurements	
  
UUID	
   Timestamp	
   Long	
   Double	
   Double	
   Double	
   Double	
   Blob	
  
²  Know start date
²  Know all measurement timestamps in processed data
²  Keep a bit for each
Solution
2015-­‐01-­‐01T00:00	
   2015-­‐01-­‐01T00:01	
   2015-­‐01-­‐01T00:02	
   2015-­‐01-­‐01T00:03	
   2015-­‐01-­‐01T00:04	
   2015-­‐01-­‐01T00:05	
   2015-­‐01-­‐01T00:06	
  
1	
   0	
   1	
   1	
   0	
   1	
   1	
  
Bitwise Verifier
One minute expected timestamps, 6 minute aggregations. 2 still missing below:
3. Performance/Missing data in Collection types
Moral: limits in Cassandra are important, not always enforced, and have consequences
4. Batching for “Performance”
Slave
Master
Slave
Application
Server
Traditional Master/Slave model
Write data
²  App server writes to remote DB
²  Across network
²  Latency! Many writes => N x 200ms
²  Solution: batch multiple commands to save
~200ms
~1-10ms
~1-10ms
Single data center
4. Batching for “Performance”
Peer
B
Peer
A
Peer
C
Application
Server
Peers model with atomicity
Write data
²  Batches are atomic
²  CAP: can either lock DB across all nodes or perform on just one and publish
²  Cassandra chooses latter (fast writes)
²  => Batches with large numbers of writes all execute on A
²  => 1/3 the processing power
~200ms
~1-10ms
~1-10ms
Single data center
4. Batching for “Performance”
Moral: don’t batch for speed
5. Double Precision vs. BigDecimal
²  double a = Math.round(1.14 * 75); // round 85.5 represented as 85.4999,
gets 85
²  float 10.0/3; // = 3.3333333333333335;
²  for (float f = 10f; f!=0; f-=0.1) {
System.out.println(f);
}
²  double x = .37; //.370000004 or .36999999998 or …
Java has some quirks with floating point representations
What do the following have in common?
5. Double Precision vs. BigDecimal
The model so far
Aggrega&on_ID	
   StartDate	
   Coun
t	
  
Sum	
   Average	
   Max	
   Min	
   Measurements	
  
UUID	
   Timestamp	
   Long	
   Double	
   Double	
   Double	
   Double	
   Blob	
  
²  Cassandra written in Java
²  Java has floating point errors
²  Our aggregated values are leaking!
Aggrega&on_ID	
   StartDate	
   Count	
   Measures	
   Measurements	
  
UUID	
   Timestamp	
   Long	
   Map<String,	
  BigDecimal>	
   Blob	
  
For good measure …
²  Wrapped our measures in a Map for flexibility (add new measures on fly)
5. Double Precision vs. BigDecimal
Moral: Law of Leaky Abstractions (a Java app is a Java app)
Bonus moral: use C* collections for good, not evil
6. QueryBuilder vs Prepared Statements
CQL Driver in Java allows various types of statements
1.  Regular Statement
2.  Prepared Statement
Regular Statement:
²  Convenient
²  Readable
²  QueryBuilder to help build
²  Tempting!
6. QueryBuilder vs Prepared Statements


QueryBuilder.select().all()
.from("table")
.where(QueryBuilder.eq(“partition_key”,
5))

 

App
Server
Cassandra Cluster
Query Schematic (Regular Statement)
ResultSet
6. QueryBuilder vs Prepared Statements
Problem: Regular Statements are a lot of bytes!
Bound Statements
²  Register with C* cluster
²  Text of statement sent once with placeholders
²  Subsequent requests are a key and params
²  Avoids transfer costs
6. QueryBuilder vs Prepared Statements



 

App
Server
Cassandra Cluster
Query Schematic (Bound Statement)
ResultSet
“select * from table 
where
partition_key = ?”
5
6. QueryBuilder vs Prepared Statements
Moral: Caching is your friend. Cache queries on C*, particularly ones being done many times.
7. Row Limits
The model so far: “Wide Rows”
²  Unique ID for partition
²  StartDate clustering key allows ranged
²  Count of measurements included
²  Map of measures with precise storage
²  Binary representation of measurements included
Aggrega&on_ID	
   StartDate	
   Count	
   Measures	
   Measurements	
  
UUID	
   Timestamp	
   Long	
   Map<String,	
  BigDecimal>	
   Blob	
  
7. Row Limits
²  Cassandra row limit => 2 billion items per row
²  Best results (Ebay) “a few hundred million per row” (~500 mil)
Practical storage limits
How much time does this represent?
Time	
  Resolu&on	
   500	
  million	
  &mestamps	
  
1	
  day	
   ~1.37	
  E	
  6	
  years	
  
1	
  hour	
   57,077	
  years	
  
1	
  minute	
   951	
  years	
  
1	
  second	
   15.85	
  years	
  
1	
  millisecond	
   5.78	
  days	
  
7. Row Limits
No business case has yet used aggregations on less than 1 min
For aggregations we’re probably fine
But we collect raw/processed measurements as well
At millisecond resolution, <6 days not ok
Can constrain row size using compound PK
²  Have resolution on channel, Rc (milliseconds)
²  Have number of items in row K (eg. 500m)
²  Get a baseline on epoch (Jan 1, 1970 12:00AM)
²  => The Batch index can be calculated
double batchInd = Math.floor(date.getMillis()/ K * Rc)
7. Row Limits
Leads to model
Aggrega&on_ID	
   BatchIndex	
   StartDate	
   Count	
   Measures	
   Measurements	
  
UUID	
   Int	
   Timestamp	
   Long	
   Map<String,	
  BigDecimal>	
   Blob	
  
CREATE TABLE aggregations
Aggregation_ID varchar,
BatchIndex int,
StartDate timestamp,
Count long,
Measures map<string, blob>,
Measurements blob,
PRIMARY KEY ((Aggregation_ID, BatchIndex), StartDate)
CQL:
7. Row Limits
Moral: Haven’t we had enough morals for one story?
Wrapup
Aggrega&on_ID	
   Index	
   Period	
   Count	
   Sum	
   Average	
   Max	
   Min	
   Measurements	
  
UUID	
   Long	
   Timestamp	
   Long	
   Double	
   Double	
   Double	
   Double	
   Map<Timestamp,	
  Double>	
  
Final model
Initial model
Aggrega&on_ID	
   BatchIndex	
   StartDate	
   Count	
   Measures	
   Measurements	
  
UUID	
   Int	
   Timestamp	
   Long	
   Map<String,	
  BigDecimal>	
   Blob	
  
1.  Couldn’t do ranged queries in time
2.  Ran out of space in measurement map
3.  Columnar approach to measures => less flexibility
4.  Rows not very wide
Evolution
Wrapup
Lessons Learned
1.  Read the manual. Partitioners are important. Other configuration options as well.
2.  Consider which queries you need to make and design around them.
3.  Limits in Cassandra are important, not always enforced, and have consequences.
Exceeding collection limits will lose you data.
4.  Don’t batch for speed, only for atomicity.
5.  C* is a Java app and subject to floating point errors
6.  C* collections are useful for avoiding multitable queries without joins.
7.  Cache queries on C* using Prepared/Bound statments, particularly ones being done
many times.
8.  Pay attention to row limits
Wrapup
By understanding Cassandra (& how she differs from SQL), we avoid our servers
(our business) meeting this fate.
Sorry Brad.

Mais conteúdo relacionado

Mais procurados

Data Mining with Splunk
Data Mining with SplunkData Mining with Splunk
Data Mining with Splunk
David Carasso
 

Mais procurados (20)

Maximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra ConnectorMaximum Overdrive: Tuning the Spark Cassandra Connector
Maximum Overdrive: Tuning the Spark Cassandra Connector
 
Escape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* OpsEscape From Hadoop: Spark One Liners for C* Ops
Escape From Hadoop: Spark One Liners for C* Ops
 
Spark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 FuriousSpark and Cassandra 2 Fast 2 Furious
Spark and Cassandra 2 Fast 2 Furious
 
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Nike Tech Talk:  Double Down on Apache Cassandra and SparkNike Tech Talk:  Double Down on Apache Cassandra and Spark
Nike Tech Talk: Double Down on Apache Cassandra and Spark
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Cassandra 2.0 and timeseries
Cassandra 2.0 and timeseriesCassandra 2.0 and timeseries
Cassandra 2.0 and timeseries
 
How to find and fix your Oracle application performance problem
How to find and fix your Oracle application performance problemHow to find and fix your Oracle application performance problem
How to find and fix your Oracle application performance problem
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
Beyond the Query: A Cassandra + Solr + Spark Love Triangle Using Datastax Ent...
 
Real data models of silicon valley
Real data models of silicon valleyReal data models of silicon valley
Real data models of silicon valley
 
Data Mining with Splunk
Data Mining with SplunkData Mining with Splunk
Data Mining with Splunk
 
Data Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes backData Wars: The Bloody Enterprise strikes back
Data Wars: The Bloody Enterprise strikes back
 
Cassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series ModelingCassandra Basics, Counters and Time Series Modeling
Cassandra Basics, Counters and Time Series Modeling
 
Game of Fraud Detection with SQL and Machine Learning
Game of Fraud Detection with SQL and Machine LearningGame of Fraud Detection with SQL and Machine Learning
Game of Fraud Detection with SQL and Machine Learning
 
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
 
Tuning the g1gc
Tuning the g1gcTuning the g1gc
Tuning the g1gc
 
Introduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandraIntroduction to data modeling with apache cassandra
Introduction to data modeling with apache cassandra
 
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
 
Advanced data modeling with apache cassandra
Advanced data modeling with apache cassandraAdvanced data modeling with apache cassandra
Advanced data modeling with apache cassandra
 

Destaque

Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
DataStax Academy
 
Apache Cassandra at Narmal 2014
Apache Cassandra at Narmal 2014Apache Cassandra at Narmal 2014
Apache Cassandra at Narmal 2014
DataStax Academy
 
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
DataStax Academy
 

Destaque (20)

Advanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in CassandraAdvanced search and Top-K queries in Cassandra
Advanced search and Top-K queries in Cassandra
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
DataStax: The Whys of NoSQL
DataStax: The Whys of NoSQLDataStax: The Whys of NoSQL
DataStax: The Whys of NoSQL
 
Petabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise StackPetabridge: The New .NET Enterprise Stack
Petabridge: The New .NET Enterprise Stack
 
DataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenterDataStax: Setting Your Database Management on Autopilot with OpsCenter
DataStax: Setting Your Database Management on Autopilot with OpsCenter
 
DataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart AnalyticsDataStax: Making a Difference with Smart Analytics
DataStax: Making a Difference with Smart Analytics
 
DataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterpriseDataStax: Steps to successfully implementing NoSQL in the enterprise
DataStax: Steps to successfully implementing NoSQL in the enterprise
 
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
IBM Spark Technology Center: Real-time Advanced Analytics and Machine Learnin...
 
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax EnterpriseSolr & Cassandra: Searching Cassandra with DataStax Enterprise
Solr & Cassandra: Searching Cassandra with DataStax Enterprise
 
DataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QADataStax: Ramping up Cassandra QA
DataStax: Ramping up Cassandra QA
 
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra MigrationInfosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
Infosys Ltd: Performance Tuning - A Key to Successful Cassandra Migration
 
Reltio: Powering Enterprise Data-driven Applications with Cassandra
Reltio: Powering Enterprise Data-driven Applications with CassandraReltio: Powering Enterprise Data-driven Applications with Cassandra
Reltio: Powering Enterprise Data-driven Applications with Cassandra
 
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
 
DataStax: What's New in Apache TinkerPop - the Graph Computing Framework
DataStax: What's New in Apache TinkerPop - the Graph Computing FrameworkDataStax: What's New in Apache TinkerPop - the Graph Computing Framework
DataStax: What's New in Apache TinkerPop - the Graph Computing Framework
 
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
 
Apache Cassandra at Narmal 2014
Apache Cassandra at Narmal 2014Apache Cassandra at Narmal 2014
Apache Cassandra at Narmal 2014
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
 
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
 
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...
 

Semelhante a Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the Cassandra Paradigm

11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01
Karam Abuataya
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
cookie1969
 
Schema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12cSchema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12c
uzzal basak
 

Semelhante a Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the Cassandra Paradigm (20)

11thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp0111thingsabout11g 12659705398222 Phpapp01
11thingsabout11g 12659705398222 Phpapp01
 
11 Things About11g
11 Things About11g11 Things About11g
11 Things About11g
 
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdfHailey_Database_Performance_Made_Easy_through_Graphics.pdf
Hailey_Database_Performance_Made_Easy_through_Graphics.pdf
 
Real-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on KubernetesReal-Time Health Score Application using Apache Spark on Kubernetes
Real-Time Health Score Application using Apache Spark on Kubernetes
 
Schema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12cSchema replication using oracle golden gate 12c
Schema replication using oracle golden gate 12c
 
IBM Connect 2014 - AD204: What's new in the IBM Domino Objects: By Example
IBM Connect 2014 - AD204: What's new in the IBM Domino Objects: By ExampleIBM Connect 2014 - AD204: What's new in the IBM Domino Objects: By Example
IBM Connect 2014 - AD204: What's new in the IBM Domino Objects: By Example
 
Big data sql as exadata for hadoop. Oracle power on top of your Big Data solu...
Big data sql as exadata for hadoop. Oracle power on top of your Big Data solu...Big data sql as exadata for hadoop. Oracle power on top of your Big Data solu...
Big data sql as exadata for hadoop. Oracle power on top of your Big Data solu...
 
Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance Data Mining & Analytics for U.S. Airlines On-Time Performance
Data Mining & Analytics for U.S. Airlines On-Time Performance
 
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
Accurate and Reliable What-If Analysis of Business Processes: Is it Achievable?
 
DAC
DACDAC
DAC
 
SFScon22 - Anton Dignoes - Managing Temporal Data in PostgreSQL.pdf
SFScon22 - Anton Dignoes - Managing Temporal Data in PostgreSQL.pdfSFScon22 - Anton Dignoes - Managing Temporal Data in PostgreSQL.pdf
SFScon22 - Anton Dignoes - Managing Temporal Data in PostgreSQL.pdf
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
this-is-garbage-talk-2022.pptx
this-is-garbage-talk-2022.pptxthis-is-garbage-talk-2022.pptx
this-is-garbage-talk-2022.pptx
 
Apache Cassandra at Macys
Apache Cassandra at MacysApache Cassandra at Macys
Apache Cassandra at Macys
 
Tracing-for-fun-and-profit.pptx
Tracing-for-fun-and-profit.pptxTracing-for-fun-and-profit.pptx
Tracing-for-fun-and-profit.pptx
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Advanced tips for making Oracle databases faster
Advanced tips for making Oracle databases fasterAdvanced tips for making Oracle databases faster
Advanced tips for making Oracle databases faster
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18What’s New in Imply 3.3 & Apache Druid 0.18
What’s New in Imply 3.3 & Apache Druid 0.18
 
2011 nri-pratiques tests-avancees
2011 nri-pratiques tests-avancees2011 nri-pratiques tests-avancees
2011 nri-pratiques tests-avancees
 

Mais de DataStax Academy

Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
DataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
DataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
DataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
DataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
DataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
DataStax Academy
 

Mais de DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the Cassandra Paradigm

  • 1. Time Believing Cassandra Timeli.io’s Big-Data journey to enlightenment under the C* Paradigm Keith Nordstrom, PhD. CTO and Co-Founder, Timeli.io
  • 2. 2   Company §  Founded in 2013 §  Based in Boulder, CO & Sunnyvale, CA Product/Business §  Predictive asset analytics solutions §  Operational applications for connected equipment Technology Platform §  Time series data and analytics platform §  Proprietary time series data processing layer §  Leverages “best of breed” open source software Industry Verticals §  Oil & Gas §  Manufacturing §  Utilities – Electric, Gas & Water Company Overview
  • 3. Who are we to talk? ²  Time Series data ingestion engine, platform, predictive analytics ²  Validation, Estimation, Regularization ²  Aggregations (ie. Coarse Graining) ²  Based on Utilities software started in Europe in 2009 ²  Added Cassandra to stack in 2011 Timeli.io I started in late 2013 and discovered quickly something they had missed: Cassandra can be hard to do right
  • 5. But first … Cassandra: ²  Sister to Helen of Troy ²  More beautiful, more sought after, wiser ²  Even the gods themselves ²  Promised a wild night to Apollo for power of prophecy ²  Reneged ²  Apollo left her with prophecy, but made it so nobody believed her … a minor cultural digression Moral: Cassandra accurately predicted the Fall of Troy.
  • 6. Just like Cassandra of legend … … real-life Cassandra difficult to “believe” ²  Selects designed beforehand ²  Denormalization ²  Many arcane configuration options ²  Hard to find expertise ²  Based on “tables” but not tabular ²  CQL looks like SQL. It’s not SQL. “No indexed columns present in by-columns clause with Equal operator” “ORDER BY is only supported when the partition key is restricted by an EQ or an IN” “PRIMARY KEY column ‘timestamp’ cannot be restricted” “Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.”
  • 7. What did this mean for Timeli? Example: Timeli ingests data, writes to raw, writes to processed, then coarse grains 1 or more series into “aggregations.”
  • 8. Multiple very competent RDBMS/Java/JPA architects built a time series app where the following could not be done: SELECT * FROM aggregations where meter_id=4bbedd76-4e9e-11e5-885d-feff819cdc9f AND timestamp > 2013-01-01 AND timestamp < 2013-03-01; Early Warning Aggregations, the primary product: “It’s a security feature! You have to know when your data exists to get your data!” Cassandra isn’t crazy
  • 9. New Beginnings Out of all of this, Timeli was born What did we change? 1.  Partitioner 2.  Primary Keys and Row Keys 3.  Performance/Missing data in Collection types 4.  Batching for “Performance” 5.  Double Precision vs. BigDecimal 6.  QueryBuilder vs Prepared Statements 7.  Row Limits
  • 10. 1. The Partitioner What is a partitioner in Cassandra? Data Cassandra Ring ²  Byte Ordered Partitioner ²  Random Partitioner ²  Mumur3 Partitioner Three Types: B … S … S … S … Z … T …
  • 11. 1. The Partitioner What is a partitioner in Cassandra? S … S … S … T … B … Data Z … Cassandra Ring ²  Byte Ordered Partitioner ²  Random Partitioner ²  Mumur3 Partitioner Three Types:
  • 12. 1. The Partitioner What is a partitioner in Cassandra? S … T … B … S … Data Z … S … Cassandra Ring ²  Byte Ordered Partitioner ²  Random Partitioner ²  Mumur3 Partitioner Three Types: Murmur3 is a random partitioner as well but faster
  • 13. 1. The Partitioner What is a partitioner in Cassandra? S … T … B … S … Data Z … S … Cassandra Ring ²  Byte Ordered Partitioner ²  Random Partitioner ²  Mumur3 Partitioner Three Types: ²  Our partition keys were of form {UUID}|{string key} ²  UUID 1s are uniformly distributed but keys are not ²  ByteOrderedPartitioner left big gaps: > nodetool status ts Datacenter: us-central1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN 10.82.79.110 4.27 GB 256 44.9% 29d0a723-fc1f-4f73-a864-97dc6df045f5 b UN 10.105.185.1 2.51 GB 256 26.4% 1d236bd9-5fb1-4423-83bc-168bac924db4 b UN 10.234.92.2 2.73 GB 256 28.7% 29e1358a-bef2-495e-80bc-3de4c4499790 b
  • 14. 1. The Partitioner Moral: Read the manual. Odds are you won’t think of consequences on your own.
  • 15. 2. Primary Keys and Row Keys Aggregation Table A coarse graining of a time series into measures on buckets of larger size than original time resolution 0 5 10 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 T24 Original Original 0 5 10 T1 T8 T9 T16 T24 T24 8-Hour Mean 8-Hour Max 8-Hour Min
  • 16. 2. Primary Keys and Row Keys Original Persistence Model Aggrega&on_ID   Index   Period   Count   Sum   Average   Max   Min   Measurements   UUID   Long   DateTime   Long   Double   Double   Double   Double   Map<DateTime,  Double>   ²  Aggregation_ID: UUID/ identifier associated with aggregation metadata ²  Period: DateTime of start of aggregation ²  Index: Offset from DateTime of fixed aggregation bucket ²  Count, Sum, Average, Max, Min: values of aggregation on the bucket ²  Measurements: map of all measurements included in the system PRIMARY KEY (Aggregation_ID, Index)
  • 17. 2. Primary Keys and Row Keys Original Persistence Model, Storage Representation Aggregation_ID Index 1 Index 2 Index 3 … Period, Count, etc. Period, Count, etc. Period, Count, etc. Index N Period, Count, etc. ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 ✔
  • 18. 2. Primary Keys and Row Keys Original Persistence Model, Storage Representation Aggregation_ID Index 1 Index 2 Index 3 … Period, Count, etc. Period, Count, etc. Period, Count, etc. Index N Period, Count, etc. ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 AND Index = 1 ✔ ✔
  • 19. 2. Primary Keys and Row Keys Original Persistence Model, Storage Representation Aggregation_ID Index 1 Index 2 Index 3 … Period, Count, etc. Period, Count, etc. Period, Count, etc. Index N Period, Count, etc. ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 AND Index = 1 ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 AND Index >= 1 AND Index < 3 ✔ ✔ ✔
  • 20. 2. Primary Keys and Row Keys Original Persistence Model, Storage Representation Aggregation_ID Index 1 Index 2 Index 3 … Period, Count, etc. Period, Count, etc. Period, Count, etc. Index N Period, Count, etc. ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 AND Index = 1 ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 AND Index >= 1 AND Index < 3 ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 AND Index = 1 AND Period > 2015-01-01 AND Period < 2015-02-01 ✔ ✔ ✔ ✖
  • 21. 2. Primary Keys and Row Keys Fixed Persistence Model Aggrega&on_ID   StartDate   Count   Sum   Average   Max   Min   Measurements   UUID   Timestamp   Long   Double   Double   Double   Double   Map<Timestamp,  Double>   PRIMARY KEY (Aggregation_ID, StartDate) ²  Index column not required ²  Primary key allows row key and clustering Aggregation_ID 2015-01-01 2015-01-02 2015-01-03 … Count, etc. Count, etc. Count, etc. 2015-12-31 Count, etc. ²  SELECT * FROM aggregations WHERE Aggregation_ID = bdb8330e-6f02-457f-8eb7-553b4db86287 AND Index = 1 AND Period > 2015-01-01 AND Period < 2015-02-01 ✔ 2015-01-31 Count, etc. …
  • 22. 2. Primary Keys and Row Keys Moral: Consider which queries you need to make and design around them
  • 23. 3. Performance/Missing data in Collection types Collections in C* ²  C*: supposed to denormalize data ²  Measurements arriving to be included in aggregation ²  How to be sure they’re included? ²  Keep copy Rationale Aggrega&on_ID   StartDate   Count   Sum   Average   Max   Min   Measurements   UUID   Timestamp   Long   Double   Double   Double   Double   Map<Timestamp,  Double>  
  • 24. 3. Performance/Missing data in Collection types Collections in C* ²  C*: supposed to denormalize data ²  Measurements arriving to be included in aggregation ²  How to be sure they’re included? ²  Keep copy Rationale Downsides ²  Lots of storage space – do we really need value? ²  In < 2.1, performance implications (serialization) ²  All values returned ²  64K limit! modulus => missing data Aggrega&on_ID   StartDate   Count   Sum   Average   Max   Min   Measurements   UUID   Timestamp   Long   Double   Double   Double   Double   Map<Timestamp,  Double>  
  • 25. 3. Performance/Missing data in Collection types Collections in C* Aggrega&on_ID   StartDate   Count   Sum   Average   Max   Min   Measurements   UUID   Timestamp   Long   Double   Double   Double   Double   Blob   ²  Know start date ²  Know all measurement timestamps in processed data ²  Keep a bit for each Solution 2015-­‐01-­‐01T00:00   2015-­‐01-­‐01T00:01   2015-­‐01-­‐01T00:02   2015-­‐01-­‐01T00:03   2015-­‐01-­‐01T00:04   2015-­‐01-­‐01T00:05   2015-­‐01-­‐01T00:06   1   0   1   1   0   1   1   Bitwise Verifier One minute expected timestamps, 6 minute aggregations. 2 still missing below:
  • 26. 3. Performance/Missing data in Collection types Moral: limits in Cassandra are important, not always enforced, and have consequences
  • 27. 4. Batching for “Performance” Slave Master Slave Application Server Traditional Master/Slave model Write data ²  App server writes to remote DB ²  Across network ²  Latency! Many writes => N x 200ms ²  Solution: batch multiple commands to save ~200ms ~1-10ms ~1-10ms Single data center
  • 28. 4. Batching for “Performance” Peer B Peer A Peer C Application Server Peers model with atomicity Write data ²  Batches are atomic ²  CAP: can either lock DB across all nodes or perform on just one and publish ²  Cassandra chooses latter (fast writes) ²  => Batches with large numbers of writes all execute on A ²  => 1/3 the processing power ~200ms ~1-10ms ~1-10ms Single data center
  • 29. 4. Batching for “Performance” Moral: don’t batch for speed
  • 30. 5. Double Precision vs. BigDecimal ²  double a = Math.round(1.14 * 75); // round 85.5 represented as 85.4999, gets 85 ²  float 10.0/3; // = 3.3333333333333335; ²  for (float f = 10f; f!=0; f-=0.1) { System.out.println(f); } ²  double x = .37; //.370000004 or .36999999998 or … Java has some quirks with floating point representations What do the following have in common?
  • 31. 5. Double Precision vs. BigDecimal The model so far Aggrega&on_ID   StartDate   Coun t   Sum   Average   Max   Min   Measurements   UUID   Timestamp   Long   Double   Double   Double   Double   Blob   ²  Cassandra written in Java ²  Java has floating point errors ²  Our aggregated values are leaking! Aggrega&on_ID   StartDate   Count   Measures   Measurements   UUID   Timestamp   Long   Map<String,  BigDecimal>   Blob   For good measure … ²  Wrapped our measures in a Map for flexibility (add new measures on fly)
  • 32. 5. Double Precision vs. BigDecimal Moral: Law of Leaky Abstractions (a Java app is a Java app) Bonus moral: use C* collections for good, not evil
  • 33. 6. QueryBuilder vs Prepared Statements CQL Driver in Java allows various types of statements 1.  Regular Statement 2.  Prepared Statement Regular Statement: ²  Convenient ²  Readable ²  QueryBuilder to help build ²  Tempting!
  • 34. 6. QueryBuilder vs Prepared Statements QueryBuilder.select().all() .from("table") .where(QueryBuilder.eq(“partition_key”, 5)) App Server Cassandra Cluster Query Schematic (Regular Statement) ResultSet
  • 35. 6. QueryBuilder vs Prepared Statements Problem: Regular Statements are a lot of bytes! Bound Statements ²  Register with C* cluster ²  Text of statement sent once with placeholders ²  Subsequent requests are a key and params ²  Avoids transfer costs
  • 36. 6. QueryBuilder vs Prepared Statements App Server Cassandra Cluster Query Schematic (Bound Statement) ResultSet “select * from table where partition_key = ?” 5
  • 37. 6. QueryBuilder vs Prepared Statements Moral: Caching is your friend. Cache queries on C*, particularly ones being done many times.
  • 38. 7. Row Limits The model so far: “Wide Rows” ²  Unique ID for partition ²  StartDate clustering key allows ranged ²  Count of measurements included ²  Map of measures with precise storage ²  Binary representation of measurements included Aggrega&on_ID   StartDate   Count   Measures   Measurements   UUID   Timestamp   Long   Map<String,  BigDecimal>   Blob  
  • 39. 7. Row Limits ²  Cassandra row limit => 2 billion items per row ²  Best results (Ebay) “a few hundred million per row” (~500 mil) Practical storage limits How much time does this represent? Time  Resolu&on   500  million  &mestamps   1  day   ~1.37  E  6  years   1  hour   57,077  years   1  minute   951  years   1  second   15.85  years   1  millisecond   5.78  days  
  • 40. 7. Row Limits No business case has yet used aggregations on less than 1 min For aggregations we’re probably fine But we collect raw/processed measurements as well At millisecond resolution, <6 days not ok Can constrain row size using compound PK ²  Have resolution on channel, Rc (milliseconds) ²  Have number of items in row K (eg. 500m) ²  Get a baseline on epoch (Jan 1, 1970 12:00AM) ²  => The Batch index can be calculated double batchInd = Math.floor(date.getMillis()/ K * Rc)
  • 41. 7. Row Limits Leads to model Aggrega&on_ID   BatchIndex   StartDate   Count   Measures   Measurements   UUID   Int   Timestamp   Long   Map<String,  BigDecimal>   Blob   CREATE TABLE aggregations Aggregation_ID varchar, BatchIndex int, StartDate timestamp, Count long, Measures map<string, blob>, Measurements blob, PRIMARY KEY ((Aggregation_ID, BatchIndex), StartDate) CQL:
  • 42. 7. Row Limits Moral: Haven’t we had enough morals for one story?
  • 43. Wrapup Aggrega&on_ID   Index   Period   Count   Sum   Average   Max   Min   Measurements   UUID   Long   Timestamp   Long   Double   Double   Double   Double   Map<Timestamp,  Double>   Final model Initial model Aggrega&on_ID   BatchIndex   StartDate   Count   Measures   Measurements   UUID   Int   Timestamp   Long   Map<String,  BigDecimal>   Blob   1.  Couldn’t do ranged queries in time 2.  Ran out of space in measurement map 3.  Columnar approach to measures => less flexibility 4.  Rows not very wide Evolution
  • 44. Wrapup Lessons Learned 1.  Read the manual. Partitioners are important. Other configuration options as well. 2.  Consider which queries you need to make and design around them. 3.  Limits in Cassandra are important, not always enforced, and have consequences. Exceeding collection limits will lose you data. 4.  Don’t batch for speed, only for atomicity. 5.  C* is a Java app and subject to floating point errors 6.  C* collections are useful for avoiding multitable queries without joins. 7.  Cache queries on C* using Prepared/Bound statments, particularly ones being done many times. 8.  Pay attention to row limits
  • 45. Wrapup By understanding Cassandra (& how she differs from SQL), we avoid our servers (our business) meeting this fate. Sorry Brad.