Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
1. Brian O’Neill Taylor Goetz
Lead Architect, Health Market Science Development Lead, Health Market Science
boneill@healthmarketscience.com ptgoetz@healthmarketscience.com
@boneill42 @ptgoetz
2. Agenda
Use Case
What is CEP? Why? What for?
Storm
Background
Cluster configuration
Examples / Demo
Future : Trident
3. Our Products
Master Data Management
Good, bad doctors?
Prescriber eligibility and remediation.
4. Cassandra to the Rescue
1000’s of Feeds
C* Masterfile
Big Data for us == Variety of Data
Δt
6. What might that look like?
wide-row index
C* I’m happy
RDBMS
Provide for Polyglot Persistence
7. What we did wrong… (part 1)
Could not react to transactional changes
Needed extra logic to track what changed
Took too long
8. What we did wrong… (part 2)
Wide Row
….
C*
Indexing
TRIGGERS
Good Intentions
Take the onus off the clients
Bad Result
Guaranteeing execution
Write Overhead
9. What is Complex Event
Processing?
Event processing is a method of tracking and
analyzing (processing) streams of information
(data) about things that happen (events),[1]
and deriving a conclusion from them.
Complex event processing, or CEP, is event
processing that combines data from multiple
sources[2] to infer events or patterns that
suggest more complicated circumstances.
http://en.wikipedia.org/wiki/Complex_event_processing
10. Imagine…
Treating CRUD operations as events in a
system.
Then Suddenly,
CEP = (ETL or Analytics)
11. What Storm is to us…
Crud
Op ETL Dimensional
Enrichment
Counts
Fuzzy
SoR RDBMS
Index
A High Throughput
Data Processing Pipeline
12.
13. Storm Overview
Open-Sourced by Twitter in 2011
Distributed Realtime Computation System
Fault Tolerant
Highly Scalable
Guaranteed Processing
Operates on one or more streams of data (i.e.
CEP)
15. Storm Components
Spouts
Stream Sources
Bolts
Unit of Computation
Topologies
Combination of n Spouts and n Bolts
Defines the overall “Computation”
16. Storm Spouts
Represents a source (stream) of data
Queues (JMS, Kafka, Kestrel, etc.)
Twitter Firehose
Sensor Data
Emits “Tuples” (Events) based on source
Primary Storm data structure
Set of Key-Value pairs
17. Storm Bolts
Receive Tuples from Spouts or other Bolts
Operate on, or React to Data
Functions/Filters/Joins/Aggregations
Database writes/lookups
Optionally emit additional Tuples
18. Storm Topologies
Data flow between spouts and bolts
Routing of Tuples between spouts/bolts
Stream “Groupings”
Parallelism of Components
Long-Lived
19. Storm and Cassandra
Use Cases:
Write Storm Tuple data to C*
○ Computation Results
○ Pre-computed indices
Read data from C* and emit Storm Tuples
○ Dynamic Lookups
http://github.com/hmsonline/storm-cassandra
20. Storm Cassandra Bolt Types
CassandraBolt
Cassandra
LookupBolt
C*
CassandraBolt
Writes data to Cassandra
Available in Batching and Non-Batching
CassandraLookupBolt
Reads data from Cassandra
http://github.com/hmsonline/storm-cassandra
22. Storm-Cassandra Project
TupleMapper Interface
Tells the CassandraBolt how to write a tuple to an
arbitrary data model
Given a Storm Tuple:
Map to Column Family
Map to Row Key
Map to Columns
http://github.com/hmsonline/storm-cassandra
23. Storm-Cassandra Project
ColumnsMapper Interface
Tells the CassandraLookupBolt how to transform a
C* row into a Storm Tuple
Given a C* Row Key and list of Columns:
Return a list of Storm Tuples
http://github.com/hmsonline/storm-cassandra
24. Storm-Cassandra Project
Current State:
Version 0.4.0-WIP
Uses Astyanax Client
Several out-of-the-box *Mapper Implementations:
○ Basic Key-Value Columns
○ Value-less Columns
○ Counter Columns
○ Lookup by row key
○ Lookup by range query
Initial pass at Trident support
Initial pass at Composite Column Support
http://github.com/hmsonline/storm-cassandra
25. Storm-Cassandra Project
Future Plans:
Switch to CQL (???)
Full Trident Support
http://github.com/hmsonline/storm-cassandra
30. Trident
Provides a higher-level abstraction for stream
processing
Constructs for state management and Batching
Adds additional primitives that abstract away
common topological patterns
Deprecates transactional topologies
Distributes with Storm
32. A sample topology
TridentTopology topology = new TridentTopology();
TridentState wordCounts =
topology.newStream("spout1", spout)
.each(new Fields("sentence"),
new Split(),
new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(
MemcachedState.opaque(serverLocations),
new Count(),
new Fields("count"))
.parallelismHint(6);
https://github.com/nathanmarz/storm/wiki/Trident-state
33. Trident State
Sequenced writes by batch/transaction id.
Spouts
Transactional
○ Batch contents never change
Opaque
○ Batch contents can change
State
Transactional
○ Store tx_id with counts to maintain sequencing of writes.
Opaque
○ Store previous value in order to overwrite the current value
when contents of a batch change.
35. Brian O’Neill Taylor Goetz
Lead Architect, Health Market Science Development Lead, Health Market Science
boneill@healthmarketscience.com ptgoetz@healthmarketscience.com
@boneill42 @ptgoetz