SlideShare uma empresa Scribd logo
1 de 25
Baixar para ler offline
Distributed Realtime Computation using
By Saurabh Minni
Who am I
Saurabh Minni also on the web as @the100rabh
Yet another developer in Bangalore
I just love tinkering with different technologies
Currently working as Technical Architect at Near
Been part of planning for Barcamp Bangalore since 2007
Author of Apache Kafka Cookbook - https://www.
packtpub.com/big-data-and-business-intelligence/apache-
kafka-cookbook
What is Apache Storm
From Apache Storm website :
“Apache Storm is a free and open source distributed
realtime computation system. Storm makes it easy
to reliably process unbounded streams of data,
doing for realtime processing what Hadoop did for
batch processing. Storm is simple, can be used with
any programming language, and is a lot of fun to
use!”
Important terms in Apache Storm
● Topology
Spout A
Spout B
Bolt A2
Bolt A1
Bolt B1
Bolt A3
Important terms in Apache Storm
● Stream
Spout A
Spout B
Bolt A2
Bolt A1
Bolt B1
Bolt A3
Important terms in Apache Storm
● Spout
Spout A
Spout B
Bolt A2
Bolt A1
Bolt B1
Bolt A3
Important terms in Apache Storm
● Bolt
Spout A
Spout B
Bolt A2
Bolt A1
Bolt B1
Bolt A3
Important terms in Apache Storm
● Stream groupings
○ how that stream should be partitioned among the bolt's tasks
○ Types
■ Shuffle
■ Field
■ Partial Key
■ Direct
■ Local or Shuffle
■ All
■ None
Important terms in Apache Storm
● Reliability
○ Storm guarantees that every spout tuple will be
fully processed
○ tree of tuples triggered by every spout tuple and
determining when that tree of tuples has been
successfully completed
○ Storm fails to detect that a spout tuple has been
completed within that timeout, then it fails the
tuple and replays it later.
Important terms in Apache Storm
● Tasks
○ task corresponds to one thread of execution
○ stream groupings define how to send tuples
from one set of tasks to another set of tasks
○ You can control the parallelism in each spout
and bolt as well
Important terms in Apache Storm
● Workers
○ Each worker process is a physical JVM
○ executes a subset of all the tasks for the
topology
○ Storm tries to spread the tasks evenly across all
the workers.
Important terms in Apache Storm
● Tuple
○ Storm uses tuples as its data model
○ A tuple is a named list of values
○ field in a tuple can be an object of any type like
the primitive types, strings, and byte arrays
○ Use can implement the serializer for custom
class objects
Understanding worker processes, executors and tasks
More terms for you
Zookeeper
Nimbus
Supervisor
Some important configuration options
● nimbus.seeds
○ The worker nodes need to know which machines are the
candidate of master in order to download topology jars
and confs.
● supervisor.slots.ports
○ For each worker machine, you configure how many
workers run on that machine with this config
○ defines which ports are open for use
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
nimbus.seeds: ["111.222.333.44"]
Show me the
Code
Design patterns for distributed realtime computation
1. Streaming joins
2. Batching
3. BasicBolt
4. In-memory caching + fields grouping combo
5. Streaming top N
6. TimeCacheMap for efficiently keeping a cache of things that have been
recently updated
7. CoordinatedBolt and KeyedFairBolt for Distributed RPC
Joins
A streaming join combines two or more data streams together based on some
common field.
The join type you need will vary per application. Some applications join all
tuples for two streams over a finite window of time, whereas other
applications expect exactly one tuple for each side of the join for each join
field. Other applications may do the join completely differently. The common
pattern among all these join types is partitioning multiple input streams in the
same way. This is easily accomplished in Storm by using a fields grouping on
the same fields for many input streams to the joiner bolt.
builder.setBolt("join", new MyJoiner(), parallelism)
.fieldsGrouping("1", new Fields("joinfield1", "joinfield2"))
.fieldsGrouping("2", new Fields("joinfield1", "joinfield2"))
.fieldsGrouping("3", new Fields("joinfield1", "joinfield2"));
Batching
Oftentimes for efficiency reasons or otherwise, you want to process a group of
tuples in batch rather than individually. For example, you may want to batch
updates to a database or do a streaming aggregation of some sort.
If you want reliability in your data processing, the right way to do this is to
hold on to tuples in an instance variable while the bolt waits to do the
batching. Once you do the batch operation, you then ack all the tuples you
were holding onto.
If the bolt emits tuples, then you may want to use multi-anchoring to ensure
reliability.
BasicBolt
Many bolts follow a similar pattern of reading an input tuple, emitting zero or
more tuples based on that input tuple, and then acking that input tuple
immediately at the end of the execute method. Bolts that match this pattern
are things like functions and filters. This is such a common pattern that Storm
exposes an interface called IBasicBolt that automates this pattern for you. See
Guaranteeing message processing for more information.
In-memory caching + fields grouping combo
It's common to keep caches in-memory in Storm bolts. Caching becomes particularly powerful when
you combine it with a fields grouping. For example, suppose you have a bolt that expands short
URLs (like bit.ly, t.co, etc.) into long URLs. You can increase performance by keeping an LRU cache of
short URL to long URL expansions to avoid doing the same HTTP requests over and over. Suppose
component "urls" emits short URLS, and component "expand" expands short URLs into long URLs
and keeps a cache internally. Consider the difference between the two following snippets of code:
builder.setBolt("expand", new ExpandUrl(), parallelism)
.shuffleGrouping(1);
builder.setBolt("expand", new ExpandUrl(), parallelism)
.fieldsGrouping("urls", new Fields("url"));
The second approach will have vastly more effective caches, since the same URL will always go to the
same task. This avoids having duplication across any of the caches in the tasks and makes it much
more likely that a short URL will hit the cache.
Streaming top N
A common continuous computation done on Storm is a "streaming top N" of some sort. Suppose you have a bolt that emits
tuples of the form ["value", "count"] and you want a bolt that emits the top N tuples based on count. The simplest way to do this
is to have a bolt that does a global grouping on the stream and maintains a list in memory of the top N items.
This approach obviously doesn't scale to large streams since the entire stream has to go through one task. A better way to do
the computation is to do many top N's in parallel across partitions of the stream, and then merge those top N's together to get
the global top N. The pattern looks like this:
builder.setBolt("rank", new RankObjects(), parallelism)
.fieldsGrouping("objects", new Fields("value"));
builder.setBolt("merge", new MergeObjects())
.globalGrouping("rank");
This pattern works because of the fields grouping done by the first bolt which gives the partitioning you need for this to be
semantically correct. You can see an example of this pattern in storm-starter here.
Streaming top N (Contd)
If however you have a known skew in the data being processed it can be advantageous to use partialKeyGrouping instead of
fieldsGrouping. This will distribute the load for each key between two downstream bolts instead of a single one.
builder.setBolt("count", new CountObjects(), parallelism)
.partialKeyGrouping("objects", new Fields("value"));
builder.setBolt("rank" new AggregateCountsAndRank(), parallelism)
.fieldsGrouping("count", new Fields("key"))
builder.setBolt("merge", new MergeRanksObjects())
.globalGrouping("rank");
The topology needs an extra layer of processing to aggregate the partial counts from the upstream bolts but this only processes
aggregated values now so the bolt it is not subject to the load caused by the skewed data. You can see an example of this
pattern in storm-starter here.
TimeCacheMap
You sometimes want to keep a cache in memory of items that have been
recently "active" and have items that have been inactive for some time be
automatically expires. TimeCacheMap is an efficient data structure for doing
this and provides hooks so you can insert callbacks whenever an item is
expired.
CoordinatedBolt and KeyedFairBolt for Distributed RPC
When building distributed RPC applications on top of Storm, there are two
common patterns that are usually needed. These are encapsulated by
CoordinatedBolt and KeyedFairBolt which are part of the "standard library"
that ships with the Storm codebase.
CoordinatedBolt wraps the bolt containing your logic and figures out when
your bolt has received all the tuples for any given request. It makes heavy use
of direct streams to do this.
KeyedFairBolt also wraps the bolt containing your logic and makes sure your
topology processes multiple DRPC invocations at the same time, instead of
doing them serially one at a time.

Mais conteúdo relacionado

Mais procurados

Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormAndrea Iacono
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Sonal Raj
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with StormMariusz Gil
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataDataWorks Summit
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormMd. Shamsur Rahim
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleDataWorks Summit/Hadoop Summit
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationUday Vakalapudi
 

Mais procurados (20)

Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Storm Anatomy
Storm AnatomyStorm Anatomy
Storm Anatomy
 
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
Real Time Graph Computations in Storm, Neo4J, Python - PyCon India 2013
 
Streams processing with Storm
Streams processing with StormStreams processing with Storm
Streams processing with Storm
 
Storm
StormStorm
Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Introduction to Apache Storm
Introduction to Apache StormIntroduction to Apache Storm
Introduction to Apache Storm
 
Storm
StormStorm
Storm
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Storm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-DataStorm-on-YARN: Convergence of Low-Latency and Big-Data
Storm-on-YARN: Convergence of Low-Latency and Big-Data
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 

Semelhante a Distributed Realtime Computation using Apache Storm

Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Stormjustinjleet
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyVarun Vijayaraghavan
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Jyotirmoy Sundi
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustEvan Chan
 
10 things I’ve learnt In the clouds
10 things I’ve learnt In the clouds10 things I’ve learnt In the clouds
10 things I’ve learnt In the cloudsStuart Lodge
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)zznate
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentationIlya Bogunov
 
You are to simulate a dispatcher using a priority queue system.New.pdf
You are to simulate a dispatcher using a priority queue system.New.pdfYou are to simulate a dispatcher using a priority queue system.New.pdf
You are to simulate a dispatcher using a priority queue system.New.pdfgardenvarelianand
 
Surviving Hadoop on AWS
Surviving Hadoop on AWSSurviving Hadoop on AWS
Surviving Hadoop on AWSSoren Macbeth
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageDamien Dallimore
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.Jason Hearne-McGuiness
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
 
Fun Teaching MongoDB New Tricks
Fun Teaching MongoDB New TricksFun Teaching MongoDB New Tricks
Fun Teaching MongoDB New TricksMongoDB
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerNopparat Nopkuat
 
M|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersM|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersMariaDB plc
 
The Pillars Of Concurrency
The Pillars Of ConcurrencyThe Pillars Of Concurrency
The Pillars Of Concurrencyaviade
 

Semelhante a Distributed Realtime Computation using Apache Storm (20)

Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
 
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
Cascading talk in Etsy (http://www.meetup.com/cascading/events/169390262/)
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Storm
StormStorm
Storm
 
10 things I’ve learnt In the clouds
10 things I’ve learnt In the clouds10 things I’ve learnt In the clouds
10 things I’ve learnt In the clouds
 
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)Introduciton to Apache Cassandra for Java Developers (JavaOne)
Introduciton to Apache Cassandra for Java Developers (JavaOne)
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
You are to simulate a dispatcher using a priority queue system.New.pdf
You are to simulate a dispatcher using a priority queue system.New.pdfYou are to simulate a dispatcher using a priority queue system.New.pdf
You are to simulate a dispatcher using a priority queue system.New.pdf
 
Surviving Hadoop on AWS
Surviving Hadoop on AWSSurviving Hadoop on AWS
Surviving Hadoop on AWS
 
Splunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the messageSplunk Conf 2014 - Getting the message
Splunk Conf 2014 - Getting the message
 
MongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo SeattleMongoDB Auto-Sharding at Mongo Seattle
MongoDB Auto-Sharding at Mongo Seattle
 
A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.A Domain-Specific Embedded Language for Programming Parallel Architectures.
A Domain-Specific Embedded Language for Programming Parallel Architectures.
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
Fun Teaching MongoDB New Tricks
Fun Teaching MongoDB New TricksFun Teaching MongoDB New Tricks
Fun Teaching MongoDB New Tricks
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 
M|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data AdaptersM|18 Ingesting Data with the New Bulk Data Adapters
M|18 Ingesting Data with the New Bulk Data Adapters
 
The Pillars Of Concurrency
The Pillars Of ConcurrencyThe Pillars Of Concurrency
The Pillars Of Concurrency
 

Mais de the100rabh

Redis - Your Magical superfast database
Redis - Your Magical superfast databaseRedis - Your Magical superfast database
Redis - Your Magical superfast databasethe100rabh
 
Barcamp Bangalore App
Barcamp Bangalore AppBarcamp Bangalore App
Barcamp Bangalore Appthe100rabh
 
Making your ui look good on android
Making your ui look good on androidMaking your ui look good on android
Making your ui look good on androidthe100rabh
 
Pleasure Blogging
Pleasure BloggingPleasure Blogging
Pleasure Bloggingthe100rabh
 

Mais de the100rabh (6)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Redis - Your Magical superfast database
Redis - Your Magical superfast databaseRedis - Your Magical superfast database
Redis - Your Magical superfast database
 
Barcamp Bangalore App
Barcamp Bangalore AppBarcamp Bangalore App
Barcamp Bangalore App
 
Making your ui look good on android
Making your ui look good on androidMaking your ui look good on android
Making your ui look good on android
 
Cloud @ Home
Cloud @ HomeCloud @ Home
Cloud @ Home
 
Pleasure Blogging
Pleasure BloggingPleasure Blogging
Pleasure Blogging
 

Último

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 

Último (20)

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Distributed Realtime Computation using Apache Storm

  • 1. Distributed Realtime Computation using By Saurabh Minni
  • 2. Who am I Saurabh Minni also on the web as @the100rabh Yet another developer in Bangalore I just love tinkering with different technologies Currently working as Technical Architect at Near Been part of planning for Barcamp Bangalore since 2007 Author of Apache Kafka Cookbook - https://www. packtpub.com/big-data-and-business-intelligence/apache- kafka-cookbook
  • 3. What is Apache Storm From Apache Storm website : “Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!”
  • 4. Important terms in Apache Storm ● Topology Spout A Spout B Bolt A2 Bolt A1 Bolt B1 Bolt A3
  • 5. Important terms in Apache Storm ● Stream Spout A Spout B Bolt A2 Bolt A1 Bolt B1 Bolt A3
  • 6. Important terms in Apache Storm ● Spout Spout A Spout B Bolt A2 Bolt A1 Bolt B1 Bolt A3
  • 7. Important terms in Apache Storm ● Bolt Spout A Spout B Bolt A2 Bolt A1 Bolt B1 Bolt A3
  • 8. Important terms in Apache Storm ● Stream groupings ○ how that stream should be partitioned among the bolt's tasks ○ Types ■ Shuffle ■ Field ■ Partial Key ■ Direct ■ Local or Shuffle ■ All ■ None
  • 9. Important terms in Apache Storm ● Reliability ○ Storm guarantees that every spout tuple will be fully processed ○ tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed ○ Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.
  • 10. Important terms in Apache Storm ● Tasks ○ task corresponds to one thread of execution ○ stream groupings define how to send tuples from one set of tasks to another set of tasks ○ You can control the parallelism in each spout and bolt as well
  • 11. Important terms in Apache Storm ● Workers ○ Each worker process is a physical JVM ○ executes a subset of all the tasks for the topology ○ Storm tries to spread the tasks evenly across all the workers.
  • 12. Important terms in Apache Storm ● Tuple ○ Storm uses tuples as its data model ○ A tuple is a named list of values ○ field in a tuple can be an object of any type like the primitive types, strings, and byte arrays ○ Use can implement the serializer for custom class objects
  • 13. Understanding worker processes, executors and tasks
  • 14. More terms for you Zookeeper Nimbus Supervisor
  • 15. Some important configuration options ● nimbus.seeds ○ The worker nodes need to know which machines are the candidate of master in order to download topology jars and confs. ● supervisor.slots.ports ○ For each worker machine, you configure how many workers run on that machine with this config ○ defines which ports are open for use supervisor.slots.ports: - 6700 - 6701 - 6702 - 6703 nimbus.seeds: ["111.222.333.44"]
  • 17. Design patterns for distributed realtime computation 1. Streaming joins 2. Batching 3. BasicBolt 4. In-memory caching + fields grouping combo 5. Streaming top N 6. TimeCacheMap for efficiently keeping a cache of things that have been recently updated 7. CoordinatedBolt and KeyedFairBolt for Distributed RPC
  • 18. Joins A streaming join combines two or more data streams together based on some common field. The join type you need will vary per application. Some applications join all tuples for two streams over a finite window of time, whereas other applications expect exactly one tuple for each side of the join for each join field. Other applications may do the join completely differently. The common pattern among all these join types is partitioning multiple input streams in the same way. This is easily accomplished in Storm by using a fields grouping on the same fields for many input streams to the joiner bolt. builder.setBolt("join", new MyJoiner(), parallelism) .fieldsGrouping("1", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("2", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("3", new Fields("joinfield1", "joinfield2"));
  • 19. Batching Oftentimes for efficiency reasons or otherwise, you want to process a group of tuples in batch rather than individually. For example, you may want to batch updates to a database or do a streaming aggregation of some sort. If you want reliability in your data processing, the right way to do this is to hold on to tuples in an instance variable while the bolt waits to do the batching. Once you do the batch operation, you then ack all the tuples you were holding onto. If the bolt emits tuples, then you may want to use multi-anchoring to ensure reliability.
  • 20. BasicBolt Many bolts follow a similar pattern of reading an input tuple, emitting zero or more tuples based on that input tuple, and then acking that input tuple immediately at the end of the execute method. Bolts that match this pattern are things like functions and filters. This is such a common pattern that Storm exposes an interface called IBasicBolt that automates this pattern for you. See Guaranteeing message processing for more information.
  • 21. In-memory caching + fields grouping combo It's common to keep caches in-memory in Storm bolts. Caching becomes particularly powerful when you combine it with a fields grouping. For example, suppose you have a bolt that expands short URLs (like bit.ly, t.co, etc.) into long URLs. You can increase performance by keeping an LRU cache of short URL to long URL expansions to avoid doing the same HTTP requests over and over. Suppose component "urls" emits short URLS, and component "expand" expands short URLs into long URLs and keeps a cache internally. Consider the difference between the two following snippets of code: builder.setBolt("expand", new ExpandUrl(), parallelism) .shuffleGrouping(1); builder.setBolt("expand", new ExpandUrl(), parallelism) .fieldsGrouping("urls", new Fields("url")); The second approach will have vastly more effective caches, since the same URL will always go to the same task. This avoids having duplication across any of the caches in the tasks and makes it much more likely that a short URL will hit the cache.
  • 22. Streaming top N A common continuous computation done on Storm is a "streaming top N" of some sort. Suppose you have a bolt that emits tuples of the form ["value", "count"] and you want a bolt that emits the top N tuples based on count. The simplest way to do this is to have a bolt that does a global grouping on the stream and maintains a list in memory of the top N items. This approach obviously doesn't scale to large streams since the entire stream has to go through one task. A better way to do the computation is to do many top N's in parallel across partitions of the stream, and then merge those top N's together to get the global top N. The pattern looks like this: builder.setBolt("rank", new RankObjects(), parallelism) .fieldsGrouping("objects", new Fields("value")); builder.setBolt("merge", new MergeObjects()) .globalGrouping("rank"); This pattern works because of the fields grouping done by the first bolt which gives the partitioning you need for this to be semantically correct. You can see an example of this pattern in storm-starter here.
  • 23. Streaming top N (Contd) If however you have a known skew in the data being processed it can be advantageous to use partialKeyGrouping instead of fieldsGrouping. This will distribute the load for each key between two downstream bolts instead of a single one. builder.setBolt("count", new CountObjects(), parallelism) .partialKeyGrouping("objects", new Fields("value")); builder.setBolt("rank" new AggregateCountsAndRank(), parallelism) .fieldsGrouping("count", new Fields("key")) builder.setBolt("merge", new MergeRanksObjects()) .globalGrouping("rank"); The topology needs an extra layer of processing to aggregate the partial counts from the upstream bolts but this only processes aggregated values now so the bolt it is not subject to the load caused by the skewed data. You can see an example of this pattern in storm-starter here.
  • 24. TimeCacheMap You sometimes want to keep a cache in memory of items that have been recently "active" and have items that have been inactive for some time be automatically expires. TimeCacheMap is an efficient data structure for doing this and provides hooks so you can insert callbacks whenever an item is expired.
  • 25. CoordinatedBolt and KeyedFairBolt for Distributed RPC When building distributed RPC applications on top of Storm, there are two common patterns that are usually needed. These are encapsulated by CoordinatedBolt and KeyedFairBolt which are part of the "standard library" that ships with the Storm codebase. CoordinatedBolt wraps the bolt containing your logic and figures out when your bolt has received all the tuples for any given request. It makes heavy use of direct streams to do this. KeyedFairBolt also wraps the bolt containing your logic and makes sure your topology processes multiple DRPC invocations at the same time, instead of doing them serially one at a time.