Learning Stream Processing with Apache Storm

•
•
•
CONTACT ME @edvorkin

•
•
•
•
•
•
•
•
•
•

real-time medical news from curated Twitter feed

Every second, on average, around 6,000 tweets are tweeted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day
350,000
^
1 % = 3500
^

•How to scale
•How to deal with failures
•What to do with failed messages
•A lot of infrastructure concerns
•Complexity
•Tedious coding
DB
t
*Image credit:Nathanmarz: slideshare: storm

Inherently BATCH-Oriented System

•Exponential rise in real- time data
•New business opportunity
•Economics of OSS and commodity hardware
Stream processing has emerged as a key use case*
*Source: Discover HDP2.1: Apache Storm for Stream Data Processing. Hortonworks. 2014

•Detecting fraud while someone swiping credit card
•Place ad on website while someone is reading a specific article
•Alerts on application and machine failures
•Use stream-processing in batch oriented fashion

Created by Nathan Martz
Acquired by Twitter
Apache Incubator Project
Open sourced
Part of Hortonworks HDP2 platform
U
a
x
Top Level
Apache Project

Most mature, widely adopted framework
Source: http://storm.incubator.apache.org/

Process endless stream of data.
1M+ messages / sec on a 10- 15 node cluster
/
4

Guaranteed message processing
Û

Tuples, Streams, Spouts, Bolts and Topologies
Z
å
å
å

TUPLE
Storm data type: Immutable List of Key/Value pair of any data type
word: “Hello”
Count: 25
Frequency: 0.25

Unbounded Sequence of Tuples between nodes
STREAM

SPOUT
The Source of the Stream

Read from stream of data – queues, web logs, API calls, databases
Spout responsibilities

•Process tuples and perform actions: calculations, API calls, DB calls
•Produce new output stream based on computations
Bolt
⚡
F(x)

•A topology is a network of spouts and bolts
•Defines data flow
4

•May have multiple spouts
4

•Each spout and bolt may have many instances that perform all the processing in parallel
4
•
•
•
•
•

How tuples are send between instances of spouts and bolts
Random Distribution.
Routes tuples to bolt based on the value of the field.
Same values always route to the same bolt
Replicates the tuple stream across all the bolt tasks. Each task receive a copy of tuple.
Routes all tuple in the stream to single task. Should be used with caution.
4

compile 'org.apache.storm:storm-core:0.9.2’
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>0.9.2</version>
</dependency>

Two 1
Households 1
Both 1
Alike 1
In 1
Dignity 1
sentence
word
Word
⚡
⚡
⚡
3
final count:
Two 20
Households 24
Both 22
Alike 1
In 1
Dignity 10
"Two households, both alike in dignity"
Two
Households
Both
alike
in
dignity

SplitSentenceBolt
Resource initialization

How to scale stream processing
q
å
å
å
å
å

storm main components
Machines in a storm cluster
JVM processes running on a node. One or more per node.
Java thread running within worker JVM process.
Instances of spouts and bolts.

How tuples are send between instances of spouts and bolts

Tuple tree
Reliable vs unreliable topologies

Reliability in Bolts
Anchoring
Ack
Fail

Unit testing Storm components
a

deploying topology to a cluster
storm jar wordcount-1.0.jar com.demo.storm.WordCountTopology wordcount-topology

Monitoring and performance tuning

x
å
å
å
å
å
å
å
å

Run under supervision:
Monit, supervisord

Nimbus move work to another node

Supervisor will restart worker

Micro-Batch Stream Processing
K
å
å
å
å
å
å
å
å
å

Functions, Filters, aggregations, joins, grouping
Ordered batches of tuples. Batches can be partitioned.
Similar to Pig or Cascading
Transactional spouts
Trident has first class abstraction for reading and writing to stateful sources
Ü
4

Stream processed in small batches
•Each batch has a unique ID which is always the same on each replay
•If one tuple failed, the whole batch is reprocessed
•Higher throutput than storm but higher latency as well

How trident provides exactly –one semantics?

Store the count along with BatchID
COUNT
100
BATCHID
1
COUNT
110
BATCHID
2
10 more tuples with batchId 2
Failure: Batch 2 replayed The same batchId (2)
•Spout should replay a batch exactly as it was played before
•Trident API hide dealing with batchID complexity

å
å
å
å
å
å
å
å
å
å

Enhancing Twitter feed with lead Image and Title
•Readability enhancements
•Image Scaling
•Remove duplicates
•Custom Business Logic

use existing Spout from Storm contrib project on GitHub
Spouts exists for: Twitter, Kafka, JMS, RabbitMQ, Amazon SQS, Kinesis, MongoDB….

•Storm takes care of scalability and fault-tolerance
•What happens if there is burst in traffic?

Introducing Queuing Layer with Kafka
Ñ

Processing Groovy Rules (DSL) on a scale in real-time

å
å
å
å
å
å
å
å
å
å
å

Statsd and Storm Metrics API
http://www.michael-noll.com/blog/2013/11/06/sending-metrics-from-storm-to-graphite/

•Use cache if you can: for example Google Guava caching utilities
•In memory DB
•Tick tuples (for batch updates)

•Linear classification (Perceptron, Passive-Aggresive, Winnow, AROW)
•Linear regression (Perceptron, Passive-Aggresive)
•Clustering (KMeans)
•Feature scaling (standardization, normalization)
•Text feature extraction
•Stream statistics (mean, variance)
•Pre-Trained Twitter sentiment classifier
Trident-ML

http://www.michael-noll.com
http://www.bigdata-cookbook.com/post/72320512609/storm-metrics- how-to
http://svendvanderveken.wordpress.com/

Learning Stream Processing with Apache Storm

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (18)

Semelhante a Learning Stream Processing with Apache Storm

Semelhante a Learning Stream Processing with Apache Storm (20)

Último

Último (20)

Learning Stream Processing with Apache Storm

Notas do Editor