4. Background
• Creates by Nathan Marz @ BackType/Twitter
– Analyze twits, links, users on Twitter
• Opensourced at Sep 2011
– Eclipse Public License 1.0
– Storm 0.5.2
– 16k java and 7k Clojure Loc
– Current stable release 0.8.2
• 0.9.0 major core improvement
5. Background
• Active user group
– https://groups.google.com/group/storm-user
– https://github.com/nathanmarz/storm
– Most watched java repo at GitHub (>4k watcher)
– Used by over 30 companies
• Twitter, Groupon, Alibaba, GumGum, ..
8. Problems
• Scale is painful
• Poor fault-tolerance
– Hadoop is stateful
• Coding is tedious
• Batch processing
– Long latency
– no realtime
9. Storm
• Scalable and robust
– No persistent layer
• Guarantees no data loss
• Fault-tolerant
• Programming language agnostic
• Use case
– Stream processing
– Distributed RPC
– Continues computation
21. Tasks
• Instances of Spouts and Blots
• Managed by Supervisor
– http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
22. Stream grouping
• All grouping
– Send to all tasks
• Global grouping
– Pick task with lowest id
• Shuffle grouping
– Pick a random task
• Fields grouping
– Consistent hashing on a subset of tuple fields
23. Storm fault-tolerance
• Reliability API
– Spout tuple creation
• colloctor.emit(values, msgID);
– Child tuple creation (Bolts)
• colloctor.emit(parentTuples,
values);
– Tuple end of processing
• collector.ack(tuple);
– Tuple failed to process
• collector.fail(tuple);
24. Storm fault-tolerance
• Disable reliability API
– Globally
• Config.TOPOLOGY_ACKER_EXECUTORS = 0
– On topology level
• Collector.emit(values, msgID);
– For a single tuple
• Collector.emit(paranetTuples, values);
28. Multilang protocol
• Using ShellSpout/ShellBolt
• Process using stand in/out to communicate
• Massage are encoded as JSON/ lines of plain text
29. Three steps
• Initiate a handshake
– Keep track with process id
– Send a json object to standard input while start
– Contains
• Storm configuration, topology, context, PID directory
30. Three steps
• Start looping
– storm_sync would
expect torm_ack
• Read or write tuples
– Follow defined structure
– Implement read_msg(),
storm_emit() ,…
32. Experience
• Not hard to setup, but
– Beware of certain version of Zookeeper
– Wait a while after topology deployed
• Fast,
– Better use fabric
• Stable
– But beware of memory leak