Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
8. Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day
350,000
^
1 % = 3500
^
9. •How to scale
•How to deal with failures
•What to do with failed messages
•A lot of infrastructure concerns
•Complexity
•Tedious coding
DB
t
*Image credit:Nathanmarz: slideshare: storm
11. •Exponential rise in real- time data
•New business opportunity
•Economics of OSS and commodity hardware
Stream processing has emerged as a key use case*
*Source: Discover HDP2.1: Apache Storm for Stream Data Processing. Hortonworks. 2014
12. •Detecting fraud while someone swiping credit card
•Place ad on website while someone is reading a specific article
•Alerts on application and machine failures
•Use stream-processing in batch oriented fashion
29. •Each spout and bolt may have many instances that perform all the processing in parallel
4
•
•
•
•
•
30. How tuples are send between instances of spouts and bolts
Random Distribution.
Routes tuples to bolt based on the value of the field.
Same values always route to the same bolt
Replicates the tuple stream across all the bolt tasks. Each task receive a copy of tuple.
Routes all tuple in the stream to single task. Should be used with caution.
4
33. Two 1
Households 1
Both 1
Alike 1
In 1
Dignity 1
sentence
word
Word
⚡
⚡
⚡
3
final count:
Two 20
Households 24
Both 22
Alike 1
In 1
Dignity 10
"Two households, both alike in dignity"
Two
Households
Both
alike
in
dignity
41. storm main components
Machines in a storm cluster
JVM processes running on a node. One or more per node.
Java thread running within worker JVM process.
Instances of spouts and bolts.
66. Functions, Filters, aggregations, joins, grouping
Ordered batches of tuples. Batches can be partitioned.
Similar to Pig or Cascading
Transactional spouts
Trident has first class abstraction for reading and writing to stateful sources
Ü
4
67. Stream processed in small batches
•Each batch has a unique ID which is always the same on each replay
•If one tuple failed, the whole batch is reprocessed
•Higher throutput than storm but higher latency as well
69. Store the count along with BatchID
COUNT
100
BATCHID
1
COUNT
110
BATCHID
2
10 more tuples with batchId 2
Failure: Batch 2 replayed The same batchId (2)
•Spout should replay a batch exactly as it was played before
•Trident API hide dealing with batchID complexity
Attention!
Before you open this template be sure what you have the following fonts installed:
Novecento Sans wide font family (6 free weight)
http://typography.synthview.com
Abattis Cantarell
http://www.fontsquirrel.com/fonts/cantarell
Icon Sets Fonts:
raphaelicons-webfont.ttf from this page: http://icons.marekventur.de
iconic_stroke.ttf from this page: http://somerandomdude.com/work/open-iconic
modernpics.otf from this page: http://www.fontsquirrel.com/fonts/modern-pictograms
general_foundicons.ttf, social_foundicons.ttf, accessibility_foundicons.ttf from this page: http://www.zurb.com/playground/foundation-icons
fontawesome-webfont.ttf from this page: http://fortawesome.github.io/Font-Awesome
Entypo.otf from this page: http://www.fontsquirrel.com/fonts/entypo
sosa-regular-webfont.ttf from this page: http://tenbytwenty.com/?xxxx_posts=sosa
All fonts are permitted free use in commercial projects.
If you have difficulties to install those fonts or have no time to find all of them, please follow the FAQs:
http://graphicriver.net/item/six-template/3626243/support
Recently we at WebMD had to create application that process data from twitter
social media sentiments, machine sensors, internet of things, interconnected devices, logs, clickstream
CEP or stream processing solution existed before but was very costly
Pause
Ready for the enterprise – not only for twitter or linked in
Pause
Meaning – fault tolerant
Workers, spout, slow down on basics
A bolt processes any number of input streams and produces any number of new output streams.
Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.
A bolt processes any number of input streams and produces any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.
pause
A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt.
DAG
pause
A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt.
DAG
pause
A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt.
DAG
pause
Like driver in Hadoop
pause
pause
Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.
Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.
When emitting a tuple, the Spout provides a
"message id" that will be used to identify
the tuple later
Storm considers a tuple coming of a spout fully processed when every message in the tree has been processed. A tuple is considered failed when its tree of messages fails to be fully processed within a configurable timeout. The default is 30 seconds.
Link between incoming and derived tuple.
Master and worker node
Nimbus – simular to job tracker in Hadoop
Nimbus- responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures
Each worker node runs a daemon called the "Supervisor". The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it.
All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster.
Master and worker node
Nimbus- responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures
Each worker node runs a daemon called the "Supervisor". The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it.
All coordination between Nimbus and the Supervisors is done through a Zookeeper cluster.
Capacity – percentage of time bolt was busy executing particular task
Processing will continue. But topology lifecycle operations and reassignment facility are lost.
Run under system supervision
Trident topologies got converted into storm topologies with spout/tuples
Higher throutput than storm but higher latency as well
Spout should replay a batch exactly as it was played before
Kafka spout
Trident API hide dealing with batchID complexity
Java fluent api
Write functions or filters instead of bolts
Fire and forget
A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients
Same code, just different topologies and original sources
Lambda architecture