Storm is a fast, scalable, fault-tolerant, and easy to operate distributed realtime computation system. It guarantees that messages will be processed and allows processing big data streams reliably in real time. Storm was originally developed by Nathan Marz at BackType (acquired by Twitter) and is written in Java and Clojure. It uses a simple programming model and can scale to large clusters, making it suitable for processing millions of events per second.
2. About Me
Eiichiro Uchiumi
• A solutions architect at
working in emerging enterprise
technologies
- Cloud transformation
- Enterprise mobility
- Information optimization (big data)
https://github.com/eiichiro
@eiichirouchiumi
http://www.facebook.com/
eiichiro.uchiumi
3. What is Stream Processing?
Stream processing is a technical paradigm to process
big volume unbound sequence of tuples in realtime
• Algorithmic trading
• Sensor data monitoring
• Continuous analytics
= Stream
Source Stream Processor
4. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
5. Conceptual View
Bolt
Bolt
Bolt
Bolt
BoltSpout
Spout
Bolt:
Consumer of streams does some processing
and possibly emits new tuples
Spout:
Source of streams
Stream:
Unbound sequence of tuples
Tuple
Tuple:
List of name-value pair
Topology: Graph of computation composed of spout/bolt as the node and stream as the edge
Tuple
Tuple
6. Physical View
SupervisorNimbus
Worker
* N
Worker
Executor
* N
Task
* N
Supervisor
Supervisor
ZooKeeper
Supervisor
Supervisor
ZooKeeper
ZooKeeper Worker
Nimbus:
Master daemon process
responsible for
• distributing code
• assigning tasks
• monitoring failures
ZooKeeper:
Storing cluster operational state
Supervisor:
Worker daemon process listening for
work assigned its node
Worker:
Java process
executes a subset
of topology
Worker node
Worker process
Executor:
Java thread spawned
by worker runs on
one or more tasks of
the same component
Task:
Component (spout/
bolt) instance
performs the actual
data processing
7. Spout
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichSpout;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.utils.Utils;
public class RandomSentenceSpout extends BaseRichSpout {
! SpoutOutputCollector collector;
! Random random;
!
! @Override
! public void open(Map conf, TopologyContext context,
! ! ! SpoutOutputCollector collector) {
! ! this.collector = collector;
! ! random = new Random();
! }
! @Override
! public void nextTuple() {
! ! String[] sentences = new String[] {
! ! ! ! "the cow jumped over the moon",
! ! ! ! "an apple a day keeps the doctor away",
! ! ! ! "four score and seven years ago",
! ! ! ! "snow white and the seven dwarfs",
! ! ! ! "i am at two with nature"
! ! };
! ! String sentence = sentences[random.nextInt(sentences.length)];
! ! collector.emit(new Values(sentence));
! }
8. Spout
! @Override
! public void open(Map conf, TopologyContext context,
! ! ! SpoutOutputCollector collector) {
! ! this.collector = collector;
! ! random = new Random();
! }
! @Override
! public void nextTuple() {
! ! String[] sentences = new String[] {
! ! ! ! "the cow jumped over the moon",
! ! ! ! "an apple a day keeps the doctor away",
! ! ! ! "four score and seven years ago",
! ! ! ! "snow white and the seven dwarfs",
! ! ! ! "i am at two with nature"
! ! };
! ! String sentence = sentences[random.nextInt(sentences.length)];
! ! collector.emit(new Values(sentence));
! }
! @Override
! public void declareOutputFields(OutputFieldsDeclarer declarer) {
! ! declarer.declare(new Fields("sentence"));
! }
@Override
public void ack(Object msgId) {}
@Override
public void fail(Object msgId) {}
}
11. Starting Topology
Nimbus
Thrift server
ZooKeeperStormSubmitter
> bin/storm jar
Uploads topology JAR to
Nimbus’ inbox with
dependencies
Submits topology
configuration as JSON
and structure as Thrift
Copies topology JAR,
configuration and structure
into local file system
Sets up static information
for topology
Makes assignment
Starts topology
13. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
15. Parallelism
RandomSentence
Spout
SplitSentence
Bolt
WordCount
Bolt
Parallelism
hint = 2
Parallelism
hint = 4
Parallelism
hint = 6
Number of
tasks = Not
specified =
Same as
parallelism
hint = 2
Number of
tasks = 8
Number of
tasks = Not
specified
= 6
Number of topology worker = 4
Number of worker slots / node = 4
Number of worker nodes = 2
Number of executor threads
= 2 + 4 + 6 = 12
Number of component instances
= 2 + 8 + 6 = 16
Worker node
Worker node
Worker process
Worker process
SS
Bolt
WC
Bolt
RS
Spout
SS
Bolt
SS
Bolt
WC
Bolt
RS
Spout
SS
Bolt
SS
Bolt
WC
Bolt
SS
Bolt
WC
Bolt
SS
Bolt
WC
Bolt
SS
Bolt
WC
Bolt
Executor thread
Topology can be spread out manually without downtime
when a worker node is added
16. Message Passing
Worker process
Executor
Executor Transfer
thread
Executor
Receive
thread
From other
workers
To other
workers
Receiver queue
Transfer queue
Internal transfer queue
Interprocess communication is mediated by ZeroMQ
Outside transfer is done with Kryo serialization
Local communication is mediated by LMAX Disruptor
Inside transfer is done with no serialization
17. LMAX Disruptor
• Consumer can easily
keep up with
producer by batching
• CPU cache friendly
- The ring is implemented as
an array, so the entries can
be preloaded
• GC safe
- The entries are preallocated
up front and live forever
Large concurrent
magic ring buffer
can be used like
blocking queue
Producer
Consumer
6 million orders per second can be processed
on a single thread at LMAX
18. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
19. Fault-tolerance
Cluster works normally
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
20. Fault-tolerance
Nimbus goes down
ZooKeeper WorkerSupervisorNimbus
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
Monitoring
cluster state
Processing will still continue. But topology lifecycle operations
and reassignment facility are lost
21. Fault-tolerance
Worker node goes down
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
WorkerSupervisor
Nimbus will reassign the tasks to other machines
and the processing will continue
22. Fault-tolerance
Supervisor goes down
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
Processing will still continue. But assignment is
never synchronized
23. Fault-tolerance
Worker process goes down
ZooKeeper WorkerSupervisorNimbus
Monitoring
cluster state
Synchronizing
assignment
Sending heartbeat
Reading worker
heartbeat from
local file system
Sending executor heartbeat
Supervisor will restart the worker process
and the processing will continue
24. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
25. Reliability API
public class RandomSentenceSpout extends BaseRichSpout {
! public void nextTuple() {
! ! ...;
! ! UUID msgId = getMsgId();
! ! collector.emit(new Values(sentence), msgId);
! }
public void ack(Object msgId) {
! // Do something with acked message id.
}
public void fail(Object msgId) {
! // Do something with failed message id.
}
}
public class SplitSentenceBolt extends BaseRichBolt {
! public void execute(Tuple input) {
! ! for (String s : input.getString(0).split("s")) {
! ! ! collector.emit(input, new Values(s));
! ! }
! !
! ! collector.ack(input);
! }
}
"the"
"the cow jumped
over the moon"
"cow"
"jumped"
"over"
"the"
"moon"
Emitting tuple
with message id
Anchoring incoming tuple
to outgoing tuples
Sending ack
Tuple tree
26. Acking Framework
SplitSentence
Bolt
RandomSentence
Spout
WordCount
Bolt
Acker
implicit bolt
Acker ack
Acker fail
Acker init
Acker implicit bolt
Tuple A
Tuple C
Tuple B
64 bit number called “Ack val”Spout tuple id Spout task id
Ack val has become 0, Acker implicit bolt knows
the tuple tree has been completed
Acker ack
Acker fail
• Emitted tuple A, XOR tuple A id with ack val
• Emitted tuple B, XOR tuple B id with ack val
• Emitted tuple C, XOR tuple C id with ack val
• Acked tuple A, XOR tuple A id with ack val
• Acked tuple B, XOR tuple B id with ack val
• Acked tuple C, XOR tuple C id with ack val
27. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
28. Cluster Setup
• Setup ZooKeeper cluster
• Install dependencies on Nimbus and worker
machines
- ZeroMQ 2.1.7 and JZMQ
- Java 6 and Python 2.6.6
- unzip
• Download and extract a Storm release to Nimbus
and worker machines
• Fill in mandatory configuration into storm.yaml
• Launch daemons under supervision using “storm”
script
32. What is Storm?
Storm is
• Fast & scalable
• Fault-tolerant
• Guarantees messages will be processed
• Easy to setup & operate
• Free & open source
distributed realtime computation system
- Originally developed by Nathan Marz at BackType (acquired by Twitter)
- Written in Java and Clojure
33. Basic Resources
• Storm is available at
- http://storm-project.net/
- https://github.com/nathanmarz/storm
under Eclipse Public License 1.0
• Get help on
- http://groups.google.com/group/storm-user
- #storm-user freenode room
• Follow
- @stormprocessor and @nathanmarz
for updates on the project
34. Many Contributions
• Community repository for modules to use Storm at
- https://github.com/nathanmarz/storm-contrib
including integration with Redis, Kafka, MongoDB,
HBase, JMS, Amazon SQS and so on
• Good articles for understanding Storm internals
- http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-
topology/
- http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-
buffers/
• Good slides for understanding real-life examples
- http://www.slideshare.net/DanLynn1/storm-as-deep-into-realtime-data-processing-as-you-
can-get-in-30-minutes
- http://www.slideshare.net/KrishnaGade2/storm-at-twitter
35. Features on Deck
• Current release: 0.8.2 as of 6/28/2013
• Work in progress (older): 0.8.3-wip3
- Some bug fixes
• Work in progress (newest): 0.9.0-wip19
- SLF4J and Logback
- Pluggable tuple serialization and blowfish encryption
- Pluggable interprocess messaging and Netty implementation
- Some bug fixes
- And more
36. Advanced Topics
• Distributed RPC
• Transactional topologies
• Trident
• Using non-JVM languages with Storm
• Unit testing
• Patterns
...Not described in this presentation. So check
these out by yourself, or my upcoming session if a
chance is given :)