Time series data is everywhere: IoT, sensor data or financial transactions. The industry has moved to databases like Cassandra to handle the high velocity and high volume of data that is now common place. In this talk I will present how we have used Cassandra to store time series data. I will highlight both the Cassandra data model as well as the architecture we put in place for collecting and ingesting data into Cassandra, using Apache Kafka and Apache Storm.
2. Guido Schmutz
Working for Trivadis for more than 19 years
Oracle ACE Director for Fusion Middleware and SOA
Co-Author of different books
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Member of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 25 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://de.slideshare.net/gschmutz
Twitter: gschmutz
2
7. Data Science Lab @ Armasuisse W&T
W+T flagship project, standing
for innovation & tech transfer
Building capabilities in the
areas of:
• Social Media Intelligence
(SOCMINT)
• Big Data Technologies &
Architectures
Invest into new, innovative and not
widely-proven technology
• Batch / Real-time analysis
• NoSQL databases
• Text analysis (NLP)
• Graph Data
• …
3 Phases: June 2013 – June 2015
7
8. SOCMINT Demonstrator – Time Dimension
Major data model: Time
series (TS)
TS reflect user behaviors
over time
Activities correlate with
events
Anomaly detection
Event detection &
prediction
8
9. SOCMINT Demonstrator – Social Dimension
User-user networks (social
graphs);
Twitter: follower, retweet and
mention graphs
Who is central in a social
network?
Who has retweeted a given
tweet to whom?
9
10. SOCMINT Demonstrator - “Lambda Architecture” for Big
Data
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Batch Result StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical) Real-Time Data Processing
Stream/Event Processing
Batch
compute
Real-Time Result
Store
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest
10
11. SOCMINT Demonstrator – Frameworks & Components
in Use
Data
Collection
(Analytical) Batch Data Processing
Batch
compute
Batch Result StoreData
Sources
Channel
Data
Access
Reports
Service
Analytic
Tools
Alerting
Tools
Social
(Analytical) Real-Time Data Processing
Stream/Event Processing
Batch
compute
Real-Time Result
Store
Messaging
Result Store
Query
Engine
Result Store
Computed
Information
Raw Data
(Reservoir)
= Data in Motion = Data at Rest
11
12. Streaming Analytics Processing Pipeline
Kafka provides reliable and efficient queuing
Storm processes (rollups, counts)
Cassandrastores results at same speed
StoringProcessingQueuing
12
Twitter
Sensor 1
Twitter
Sensor 2
Twitter
Sensor 3
Visualizatio
n
Application
Visualizatio
n
Application
14. Cassandra Data Modelling
14
• Don’t think relational !
• Denormalize, Denormalize, Denormalize ….
• Rows are gigantic and sorted = one row is stored on one node
• Know your application/use cases => from query to model
• Index is not an afterthought, anymore => “index” upfront
• Control physical storage structure
18. Know your application => From query to model
18
Show Timeline of
Tweets
Show Timeseries on
different levels of
aggregation
(resolution)
• Seconds
• Minute
• Hours
19. Show Timeline: Provide Raw Data (Tweets)
19
CREATE TABLE tweet (tweet_id bigint,
username text,
message text,
hashtags list<text>,
latitude double,
longitude double,
…
PRIMARY KEY(tweet_id));
• Skinny Row Table
• Holds the sensor raw data =>
Tweets
• Similar to a relational table
• Primary Key is the partition key
10000121 username message hashtags latitude longitude
gschmutz Getting ready for .. [cassandra, nosql] 0 0
20121223 username message hashtags latitude longitude
DataStax The Speed Factor .. [BigData 0 0
tweet_id
Partition Key Clustering Key
20. Show Timeline: Provide Raw Data (Tweets)
20
INSERT INTO tweet (tweet_id, username, message, hashtags, latitude,
longitude) VALUES (10000121, 'gschmutz', 'Getting ready for my talk about
using Cassandra for Timeseries and Graph Data', ['cassandra', 'nosql'],
0,0);
SELECT tweet_id, username, hashtags, message FROM tweet
WHERE tweet_id = 10000121 ;
tweet_id | username | hashtag | message
---------+----------+------------------------+----------------------------
10000121 | gschmutz | ['cassandra', 'nosql'] | Getting ready for ...
20121223 | DataStax | [’BigData’] | The Speed Factor ...
Partition Key Clustering Key
21. Show Timeline: Provide Sequence of Events
21
CREATE TABLE tweet_timeline (
sensor_id text,
bucket_id text,
time_id timestamp,
tweet_id bigint,
PRIMARY KEY((sensor_id, bucket_id), time_id))
WITH CLUSTERING ORDER BY (time_id DESC);
Wide Row Table
bucket-id creates buckets
for columns
• SECOND-2015-10-14
ABC-001:SECOND-2015-10-14 10:00:02:tweet-id
10000121
DEF-931:SECOND-2015-10-14 10:09:02:tweet-id
1003121343
09:12:09:tweet-id
1002111343
09:10:02:tweet-id
1001121343
Partition Key Clustering Key
22. Show Timeline: Provide Sequence of Events
22
INSERT INTO tweet_timeline (sensor_id, bucket_id, time_id, tweet_id)
VALUES ('ABC-001', 'SECOND-2015-10-14', '2015-09-30 10:50:00', 10000121 );
SELECT * from tweet_timeline
WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL’ AND time_id <= '2015-10-14 12:00:00';
sensor_id | bucket_id | time_id | tweet_id
----------+-------------------+--------------------------+----------
ABC-001 | SECOND-2015-10-14 | 2015-10-14 11:53:00+0000 | 10020334
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:52:00+0000 | 10000334
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:51:00+0000 | 10000127
ABC-001 | SECOND-2015-10-14 | 2015-10-14 10:50:00+0000 | 10000121
Sorted by time_id
Partition Key Clustering Key
23. Show Timeseries: Provide list of metrics
23
CREATE TABLE tweet_count (
sensor_id text,
bucket_id text,
key text,
time_id timestamp,
count counter,
PRIMARY KEY((sensor_id, bucket_id), key, time_id))
WITH CLUSTERING ORDER BY (key ASC, time_id DESC);
Wide Row Table
bucket-id creates buckets
for columns
• SECOND-2015-10-14
• HOUR-2015-10
• DAY-2015-10
ABC-001:HOUR-2015-10 ALL:10:00:count
1’550
ABC-001:DAY-2015-10 ALL:14-OCT:count
105’999
ALL:13-OCT:count
120’344
nosql:14-OCT:count
2’532
ALL:09:00:count
2’299
nosql:08:00:count
25
30d * 24h * n keys = n * 720 cols
Partition Key Clustering Key
24. Show Timeseries: Provide list of metrics
24
UPDATE tweet_count SET count = count + 1
WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';
SELECT * from tweet_count
WHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'
AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;
sensor_id | bucket_id | key | time_id | count
----------+--------------+-----+--------------------------+-------
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240
ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230
Partition Key Clustering Key
26. Introduction to the Graph Model – Property Graph
Vertex (Node)
• Represent Entities
• Always have an ID
• Can contain properties (key-
value pairs)
Edge (Relationship)
• Lines between nodes
• may be directed or undirected
• Have IDs and properites
Properties
• Values about node or relationship
• Allow to add semantic to
relationships
User 1
Tweet 2
author
follow
retweet
User 2
Id: 16134540
name: cloudera
location: Palo Alto
Id: 18898576
name: gschmutz
location: Berne
Id: 18898999
text: CDH5 has been..
time: July 11 2015
time: June 11 2015
key: value
26
since: May 2012
Tweet 1
Id: 18898576
text: Join BigData..
time: June 11 2015
author
28. TinkerPop 3 Stack
TinkerPop is a framework composed of
various interoperable components
Vendor independent (similar to JDBC for
RDBMS)
Core API defines Graph, Vertex, Edge, …
Gremlin traversal language is vendor-
independent way to query (traverse) a graph
Gremlin server can be leveraged to allow
over the wire communication with a
TinkerPop enabled graph system
http://tinkerpop.incubator.apache.org/
28
29. Gremlin Graph Traversal Engine
29
Language / System agostic: many graph languages for many
graph systems
Provided Traversal Engine: SPARQL or any other graph query
language on the Gremlin Traversal Machine
Native distributed execution: A Gremlin Traversal over an
OLAP Graph Processor (Hadoop / Spark)
33. Summary - Know your domain
Connectedness of Datalow high
Document
Data
Store
Key-Value
Stores
Wide-
Column
Store
Graph
Databases
Relational
Databases