Storm is a distributed, reliable, fault-tolerant system for processing streams of data.
In this track we will introduce Storm framework, explain some design concepts and considerations, and show some real world examples to explain how to use it to process large amounts of data in real time, in a distributed environment. We will describe how we can scale this solution very easily as more data need to be processed.
We will explain all you need to know to get started with Storm and some tips on how to get your Spouts, Bolts and Topologies up and running in the cloud.
10. What is Storm?
● Is a highly distributed realtime computation system.
● Provides general primitives to do realtime computation.
● Can be used with any programming language.
● It's scalable and fault-tolerant.
11. Example - Main Concepts
https://github.com/geisbruch/strata2012
12. Main Concepts - Example
"Count tweets related to strataconf and show hashtag totals."
13. Main Concepts - Spout
● First thing we need is to read all tweets related to strataconf from twitter
● A Spout is a special component that is a source of messages to be processed.
Tweets Tuples
Spout
14. Main Concepts - Spout
public class ApiStreamingSpout extends BaseRichSpout {
SpoutOutputCollector collector;
TwitterReader reader;
public void nextTuple() {
Tweet tweet = reader.getNextTweet();
if(tweet != null)
collector.emit(new Values(tweet));
}
public void open(Map conf, TopologyContext context,
SpoutOutputCollector collector) {
reader = new TwitterReader(conf.get("track"), conf.get("user"), conf.get("pass"));
this.collector = collector;
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("tweet"));
}
}
Tweets Tuples
Spout
15. Main Concepts - Bolt
● For every tweet, we find hashtags
● A Bolt is a component that does the processing.
tweet hashtags
hashtags
stream tweet hashtags
Spout HashtagsSplitter
16. Main Concepts - Bolt
public class HashtagsSplitter extends BaseBasicBolt {
public void execute(Tuple input, BasicOutputCollector collector) {
Tweet tweet = (Tweet)input.getValueByField("tweet");
for(String hashtag : tweet.getHashTags()){
collector.emit(new Values(hashtag));
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("hashtag"));
}
}
tweet hashtags
tweet hashtags
hashtags
stream
Spout HashtagsSplitter
17. Main Concepts - Bolt
public class HashtagsCounterBolt extends BaseBasicBolt {
private Jedis jedis;
public void execute(Tuple input, BasicOutputCollector collector) {
String hashtag = input.getStringByField("hashtag");
jedis.hincrBy("hashs", key, val);
}
public void prepare(Map conf, TopologyContext context) {
jedis = new Jedis("localhost");
}
}
tweet
stream tweet hashtag
hashtag
hashtag
Spout HashtagsSplitter HashtagsCounterBolt
18. Main Concepts - Topology
public static Topology createTopology() {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("tweets-collector", new ApiStreamingSpout(), 1);
builder.setBolt("hashtags-splitter", new HashTagsSplitter(), 2)
.shuffleGrouping("tweets-collector");
builder.setBolt("hashtags-counter", new HashtagsCounterBolt(), 2)
.fieldsGrouping("hashtags-splitter", new Fields("hashtags"));
return builder.createTopology();
}
fields
shuffle
HashtagsSplitter HashtagsCounterBolt
Spout
HashtagsSplitter HashtagsCounterBolt
19. Example Overview - Read Tweet
Spout Splitter Counter Redis
This tweet has #hello
#world hashtags!!!
20. Example Overview - Split Tweet
Spout Splitter Counter Redis
This tweet has #hello
#world hashtags!!! #hello
#world
21. Example Overview - Split Tweet
Spout Splitter Counter Redis
This tweet has #hello
#world hashtags!! #hello
incr(#hello)
#hello = 1
#world
incr(#world)
#world = 1
22. Example Overview - Split Tweet
Spout Splitter Counter Redis
#hello = 1
This tweet has #bye
#world hashtags
#world = 1
23. Example Overview - Split Tweet
Spout Splitter Counter Redis
#hello = 1
#world
#bye
This tweet has #bye
#world hashtags #world = 1
24. Example Overview - Split Tweet
Spout Splitter Counter Redis
#hello = 1
#world
#bye
This tweet has #bye incr(#bye)
#world hashtags #world = 2
incr(#world) #bye =1
25. Main Concepts - In summary
○ Topology
○ Spout
○ Stream
○ Bolt
51. Trident - Query
Stream singleHashtag =
drpcStream.each(new Fields("args"), new Split(), new Fields("hashtag"))
DRPC Hashtag
DRPC stream Splitter
Server
52. Trident - Query
Stream hashtagsCountQuery =
singleHashtag.stateQuery(hashtagCountMemoryState, new Fields("hashtag"),
new MapGet(), new Fields("count"));
DRPC Hashtag Hashtag Count
stream Splitter Query
DRPC
Server
53. Trident - Query
Stream drpcCountAggregate =
hashtagsCountQuery.groupBy(new Fields("hashtag"))
.aggregate(new Fields("count"), new Sum(), new Fields("sum"));
DRPC Hashtag Hashtag Count Aggregate
DRPC stream Splitter Query Count
Server
54. Trident - Query - Complete Example
topology.newDRPCStream("counts",drpc)
.each(new Fields("args"), new Split(), new Fields("hashtag"))
.stateQuery(hashtags, new Fields("hashtag"),
new MapGet(), new Fields("count"))
.each(new Fields("count"), new FilterNull())
.groupBy(new Fields("hashtag"))
.aggregate(new Fields("count"), new Sum(), new Fields("sum"));
55. Trident - Query - Complete Example
DRPC Hashtag Hashtag Count Aggregate
DRPC stream Splitter Query Count
Server
56. Trident - Query - Complete Example
Client Call
DRPC Hashtag Hashtag Count Aggregate
DRPC stream Splitter Query Count
Server
57. Trident - Query - Complete Example
Hold the
Execute
Request
DRPC Hashtag Hashtag Count Aggregate
DRPC stream Splitter Query Count
Server
58. Trident - Query - Complete Example
Request continue
holded
DRPC Hashtag Hashtag Count Aggregate
DRPC stream Splitter Query Count
Server
On finish return to DRPC
server
59. Trident - Query - Complete Example
DRPC Hashtag Hashtag Count Aggregate
DRPC stream Splitter Query Count
Server
Return response to the
client
61. Running Topologies - Local Cluster
Config conf = new Config();
conf.put("some_property","some_value")
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("twitter-hashtag-summarizer", conf, builder.createTopology());
● The whole topology will run in a single machine.
● Similar to run in a production cluster
● Very Useful for test and development
● Configurable parallelism
● Similar methods than StormSubmitter
62. Running Topologies - Production
package twitter.streaming;
public class Topology {
public static void main(String[] args){
StormSubmitter.submitTopology("twitter-hashtag-summarizer",
conf, builder.createTopology());
}
}
mvn package
storm jar Strata2012-0.0.1.jar twitter.streaming.Topology
"$REDIS_HOST" $REDIS_PORT "strata,strataconf"
$TWITTER_USER $TWITTER_PASS
63. Running Topologies - Cluster Layout
● Single Master process Storm Cluster
Supervisor
● No SPOF (where the processing is done)
● Automatic Task distribution Nimbus Supervisor
(The master
(where the processing is done)
Process)
● Amount of task configurable by
process Supervisor
(where the processing is done)
● Automatic tasks re-distribution on
supervisors fails
Zookeeper Cluster
(Only for coordination and cluster state)
● Work continue on nimbus fails
64. Running Topologies - Cluster Architecture - No SPOF
Storm Cluster
Supervisor
(where the processing is done)
Nimbus Supervisor
(The master Process) (where the processing is done)
Supervisor
(where the processing is done)
65. Running Topologies - Cluster Architecture - No SPOF
Storm Cluster
Supervisor
(where the processing is done)
Nimbus Supervisor
(The master Process) (where the processing is done)
Supervisor
(where the processing is done)
66. Running Topologies - Cluster Architecture - No SPOF
Storm Cluster
Supervisor
(where the processing is done)
Nimbus Supervisor
(The master Process) (where the processing is done)
Supervisor
(where the processing is done)
68. Other features - Non JVM languages
{
● Use Non-JVM Languages "conf": {
"topology.message.timeout.secs": 3,
// etc
● Simple protocol },
"context": {
● Use stdin & stdout for communication "task->component": {
"1": "example-spout",
"2": "__acker",
● Many implementations and abstraction
"3": "example-bolt"
layers done by the community },
"taskid": 3
},
"pidDir": "..."
}
69. Contribution
● A collection of community developed spouts, bolts, serializers, and other goodies.
● You can find it in: https://github.com/nathanmarz/storm-contrib
● Some of those goodies are:
○ cassandra: Integrates Storm and Cassandra by writing tuples to a Cassandra Column Family.
○ mongodb: Provides a spout to read documents from mongo, and a bolt to write documents to mongo.
○ SQS: Provides a spout to consume messages from Amazon SQS.
○ hbase: Provides a spout and a bolt to read and write from HBase.
○ kafka: Provides a spout to read messages from kafka.
○ rdbms: Provides a bolt to write tuples to a table.
○ redis-pubsub: Provides a spout to read messages from redis pubsub.
○ And more...