SlideShare uma empresa Scribd logo
1 de 39
Apache Storm and Spark
Streaming Compared
P. Taylor Goetz, Hortonworks
@ptgoetz
Honestly...
• I know a lot more about Apache Storm than I do
Apache Spark Streaming.
• I've been involved with Apache Storm, in one
way or another, since it was open-sourced.
• I'm admittedly biased.
But...
• A number of articles/papers comparing Apache
Storm and Spark Streaming are inaccurate in
terms of Storm’s features and performance
characteristics.
• Code and configuration for those studies is not
available, so independent verification is
impossible.
• Claims don't match real-world observations.
But...
• There is an inherent “Home Team Advantage” in
any benchmark comparison.
• Without open source code, any benchmark
claims are essentially marketing fluff, and should
be taken with a grain or two of NaCl.
• Any benchmark claim should be independently
verifiable.
Spark Streaming Paper
• Compares Spark Streaming (Micro-Batch) to
Core Storm (One-at-a-Time)
• A more appropriate comparison would have
been with Storm’s Trident (Micro-Batch) API
• Trident mentioned only in passing (on pages 3
and 12)
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Spark Streaming Paper
• Benchmark code/configuration not publicly
available
• Performance claims not independently verifiable
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Spark Streaming Paper
• Granted, the Spark Streaming paper is almost 2
years old and written at a time when Trident was
relatively new.
• However, that paper is often cited when
comparing Apache Storm and Spark Streaming,
particularly in terms of performance.
• A lot can change in 2 years.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Streaming and batch
processing are
fundamentally different.
Batch vs. Streaming
• Storm is a stream processing framework that
also does micro-batching (Trident).

• Spark is a batch processing framework that also
does micro-batching (Spark Streaming).
Batch vs. Streaming
Batch Streaming
Batch vs. Streaming
Batch Streaming
Micro-Batch
Apache Storm: Two
Streaming APIs
Core Storm (Spouts and Bolts)!
• One at a Time
• Lower Latency
• Operates on Tuple Streams
Trident (Streams and Operations)!
• Micro-Batch
• Higher Throughput
• Operates on Streams of Tuple Batches and Partitions
Language Options
Core Storm Storm Trident Spark Streaming
• Java
• Clojure
• Scala
• Python
• Ruby
• others*
• Java
• Clojure
• Scala
• Java
• Scala
• Python
*Storm’s Multi-Lang feature allows the use of virtually any programming language.
Reliability Models
Core Storm Storm Trident
Spark
Streaming
At Most Once Yes Yes No
At Least Once Yes Yes No*
Exactly Once No Yes Yes*
*In some node failure scenarios, Spark Streaming
falls back to at-least-once processing or data loss.
Programing Model
Core Storm Storm Trident Spark Streaming
Stream Primitive Tuple
Tuple, Tuple
Batch, Partition
DStream
Stream Source Spouts
Spouts, Trident
Spouts
HDFS, Network
Computation/
Transformation
Bolts
Filters,
Functions,
Aggregations,
Joins
Transformation,
Window
Operations
Stateful
Operations
No
(roll your own)
Yes Yes
Output/
Persistence
Bolts State, MapState foreachRDD
Production Deployments
Apache Storm Spark Streaming
• Too many to list



http://
storm.incubator.apache.org/
documentation/Powered-
By.html
• Sharethrough



http://
engineering.sharethrough.com/blog/
2014/06/27/sharethrough-at-spark-
summit-2014-spark-streaming-for-
realtime-auctions/
Support
Apache Storm Spark
Spark
Streaming
Hadoop Distro
Hortonworks,
MapR
Cloudera,
MapR,
Hortonworks
(preview)
Hortonworks,
Cloudera,
MapR
Resource
Management
YARN, Mesos YARN, Mesos YARN*, Mesos
Provisioning/
Monitoring
Apache
Ambari
Cloudera
Manager
?
*With issues: http://spark-summit.org/wp-content/uploads/2014/07/
Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf
Failure Scenarios
Worker Failure:
Spark Streaming
"So if a worker node fails, then the system can recompute
the lost from the the left over copy of the input data.
However, if the worker node where a network receiver was
running fails, then a tiny bit of data may be lost, that is, the
data received by the system but not yet replicated to other
node(s)."
Only HDFS-backed data sources are fully fault tolerant.
https://spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-properties
Worker Failure:
Spark Streaming
https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-
zero-data-loss-in-spark-streaming.html
Solution?: Write Ahead Logs (SPARK-3129)
• Enabling WAL requires DFS (HDFS, S3) — no such
requirement with Storm
• Incurs a performance penalty that adds to overall latency
• Full fault tolerance still requires a data source that can
replay data (e.g. Kafka)!
• Architectural band aid?
Worker Failure:
Apache Storm
• If a supervisor node fails, Nimbus will reassign that node's
tasks to other nodes in the cluster.
• Any tuples sent to a failed node will time out and be
replayed (In Trident, any batches will be replayed).
• Delivery guarantees dependent on a reliable data source.
Data Source Reliability
• A data source is considered unreliable if there is no means
to replay a previously-received message.
• A data source is considered reliable if it can somehow replay
a message if processing fails at any point.
• A data source is considered durable if it can replay any
message or set of messages given the necessary selection
criteria.
!
(These are my terms.)
Reliability Limitations:
Apache Storm
• Exactly once processing requires a durable data source.
• At least once processing requires a reliable data source.
• An unreliable data source can be wrapped to provide
additional guarantees.
• With durable and reliable sources, Storm will not drop data.
• Common pattern: Back unreliable data sources with
Apache Kafka (minor latency hit traded for 100% durability).
Apache Storm Spouts
Durable!
Kafka











Reliable!
JMS
RabbitMQ /
AMQP
Kestrel
Amazon SQS
Amazon Kinesis
Unreliable!
Twitter
Scribe
MongoDB
Apache Storm Output
(Bolts, Trident State)
• Cassandra
• HBase
• HDFS
• Kafka
• Redis
• Memcached
• R
• JMS
• MongoDB
• RDBMS
Apache Storm + Kafka
Apache Kafka is an ideal source for Storm topologies. It
provides everything necessary for:
• At most once processing
• At least once processing
• Exactly once processing
Apache Storm includes Kafka spout implementations for all
levels of reliability.
Kafka Supports a wide variety of languages and integration
points for both producers and consumers.
Reliability Limitations:
Spark Streaming
• Fault tolerance and reliability guarantees require
HDFS-backed data source.
• Moving data to HDFS prior to stream processing
introduces additional latency.
• Network data sources (Kafka, etc.) are
vulnerable to data loss in the event of a worker
node failure.
https://spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-properties
Performance
“The main reason cited by Tathagata for Spark's
performance gain over Storm is the aggregation of
small records that occurs through the mechanics
of RDDs.”
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
In other words: Micro-Batching
Performance
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
Storm capped at 10k msgs/sec/node?
Spark Streaming 40x faster than Storm?
Others may disagree…
https://twitter.com/
nathanmarz/status/
207989068519317505
http://www.slideshare.net/
JamesSirota/cisco-opensoc
Netty Transport
• Introduced in Apache Storm
0.9.0
• Faster, pure Java alternative
for 0MQ
• Yahoo! Engineering
announcement:

http://yahooeng.tumblr.com/post/
64758709722/making-storm-fly-
with-netty
• Performance Test Code:

https://github.com/yahoo/storm-
perf-test
Netty
0mq
STORM-297
• Introduced in Apache Storm
0.9.2-incubating
• Big performance boost,
especially for small messages
• JIRA Discussion:

https://issues.apache.org/jira/
browse/STORM-297
• Performance Test Code:

https://github.com/yahoo/storm-
perf-test
Benchmarking Storm
• 5 nodes on AWS (m1.large - not very powerful)
• 1 ZooKeeper, 1 Nimbus, 3 Supervisors
• Storm Core API and Trident API benchmarks
• Is Trident API slower than Core API?
https://github.com/ptgoetz/storm-benchmark
Is Trident API slower than
Core API?
• On low-power hardware with 3 supervisor nodes…
• Core API:
~150k msg./sec. with ~80 ms. latency
• Trident API:
~300k msg./sec. with ~250 ms. latency
• Higher throughput possible with increased latency
• Better performance with bigger hardware
Is Spark + Spark Streaming a
"Lambda Architecture in a Box?"
• No!
• Lambda is a lot more than batch + streaming.
• Lambda is powerful when applied correctly, but is
not right for every use case.
• Spark and Spark Streaming have overlapping
programming models for batch and micro-batch.
• The rest is up to you (as it is with Storm).
Final Thoughts
In general (not specific to Spark Streaming):!
• Beware any claim that A is X times faster than B.
• Performance is a matter of proper tuning for the
use case at hand.
• Any system can be hobbled to look bad in a
benchmark.
Recommendation
• It is up to you, and your specific use case.
• Consider fault tolerance. Is data loss
acceptable?
• Consider all facets and make informed
decisions.
• Rely on your own benchmarks
Questions?
Thank you!

Mais conteúdo relacionado

Mais procurados

HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 

Mais procurados (20)

Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Etl is Dead; Long Live Streams
Etl is Dead; Long Live StreamsEtl is Dead; Long Live Streams
Etl is Dead; Long Live Streams
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggle
 
Common issues with Apache Kafka® Producer
Common issues with Apache Kafka® ProducerCommon issues with Apache Kafka® Producer
Common issues with Apache Kafka® Producer
 
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
Ten reasons to choose Apache Pulsar over Apache Kafka for Event Sourcing_Robe...
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 

Destaque

Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 

Destaque (8)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기
 
검색로그시스템 with Python
검색로그시스템 with Python검색로그시스템 with Python
검색로그시스템 with Python
 
파이썬 데이터 분석 3종세트
파이썬 데이터 분석 3종세트파이썬 데이터 분석 3종세트
파이썬 데이터 분석 3종세트
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 

Semelhante a Apache storm vs. Spark Streaming

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

Semelhante a Apache storm vs. Spark Streaming (20)

Apache Storm
Apache StormApache Storm
Apache Storm
 
Past, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormPast, Present, and Future of Apache Storm
Past, Present, and Future of Apache Storm
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Terraform - Taming Modern Clouds
Terraform  - Taming Modern CloudsTerraform  - Taming Modern Clouds
Terraform - Taming Modern Clouds
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar cluster
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 

Mais de P. Taylor Goetz (6)

Flux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & DeploymentFlux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & Deployment
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 

Último

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Último (20)

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 

Apache storm vs. Spark Streaming

  • 1. Apache Storm and Spark Streaming Compared P. Taylor Goetz, Hortonworks @ptgoetz
  • 2. Honestly... • I know a lot more about Apache Storm than I do Apache Spark Streaming. • I've been involved with Apache Storm, in one way or another, since it was open-sourced. • I'm admittedly biased.
  • 3. But... • A number of articles/papers comparing Apache Storm and Spark Streaming are inaccurate in terms of Storm’s features and performance characteristics. • Code and configuration for those studies is not available, so independent verification is impossible. • Claims don't match real-world observations.
  • 4. But... • There is an inherent “Home Team Advantage” in any benchmark comparison. • Without open source code, any benchmark claims are essentially marketing fluff, and should be taken with a grain or two of NaCl. • Any benchmark claim should be independently verifiable.
  • 5. Spark Streaming Paper • Compares Spark Streaming (Micro-Batch) to Core Storm (One-at-a-Time) • A more appropriate comparison would have been with Storm’s Trident (Micro-Batch) API • Trident mentioned only in passing (on pages 3 and 12) http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 6. Spark Streaming Paper • Benchmark code/configuration not publicly available • Performance claims not independently verifiable http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 7. Spark Streaming Paper • Granted, the Spark Streaming paper is almost 2 years old and written at a time when Trident was relatively new. • However, that paper is often cited when comparing Apache Storm and Spark Streaming, particularly in terms of performance. • A lot can change in 2 years. http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 8. Streaming and batch processing are fundamentally different.
  • 9. Batch vs. Streaming • Storm is a stream processing framework that also does micro-batching (Trident).
 • Spark is a batch processing framework that also does micro-batching (Spark Streaming).
  • 11. Batch vs. Streaming Batch Streaming Micro-Batch
  • 12. Apache Storm: Two Streaming APIs Core Storm (Spouts and Bolts)! • One at a Time • Lower Latency • Operates on Tuple Streams Trident (Streams and Operations)! • Micro-Batch • Higher Throughput • Operates on Streams of Tuple Batches and Partitions
  • 13. Language Options Core Storm Storm Trident Spark Streaming • Java • Clojure • Scala • Python • Ruby • others* • Java • Clojure • Scala • Java • Scala • Python *Storm’s Multi-Lang feature allows the use of virtually any programming language.
  • 14. Reliability Models Core Storm Storm Trident Spark Streaming At Most Once Yes Yes No At Least Once Yes Yes No* Exactly Once No Yes Yes* *In some node failure scenarios, Spark Streaming falls back to at-least-once processing or data loss.
  • 15. Programing Model Core Storm Storm Trident Spark Streaming Stream Primitive Tuple Tuple, Tuple Batch, Partition DStream Stream Source Spouts Spouts, Trident Spouts HDFS, Network Computation/ Transformation Bolts Filters, Functions, Aggregations, Joins Transformation, Window Operations Stateful Operations No (roll your own) Yes Yes Output/ Persistence Bolts State, MapState foreachRDD
  • 16. Production Deployments Apache Storm Spark Streaming • Too many to list
 
 http:// storm.incubator.apache.org/ documentation/Powered- By.html • Sharethrough
 
 http:// engineering.sharethrough.com/blog/ 2014/06/27/sharethrough-at-spark- summit-2014-spark-streaming-for- realtime-auctions/
  • 17. Support Apache Storm Spark Spark Streaming Hadoop Distro Hortonworks, MapR Cloudera, MapR, Hortonworks (preview) Hortonworks, Cloudera, MapR Resource Management YARN, Mesos YARN, Mesos YARN*, Mesos Provisioning/ Monitoring Apache Ambari Cloudera Manager ? *With issues: http://spark-summit.org/wp-content/uploads/2014/07/ Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf
  • 19. Worker Failure: Spark Streaming "So if a worker node fails, then the system can recompute the lost from the the left over copy of the input data. However, if the worker node where a network receiver was running fails, then a tiny bit of data may be lost, that is, the data received by the system but not yet replicated to other node(s)." Only HDFS-backed data sources are fully fault tolerant. https://spark.apache.org/docs/latest/streaming-programming- guide.html#fault-tolerance-properties
  • 20. Worker Failure: Spark Streaming https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and- zero-data-loss-in-spark-streaming.html Solution?: Write Ahead Logs (SPARK-3129) • Enabling WAL requires DFS (HDFS, S3) — no such requirement with Storm • Incurs a performance penalty that adds to overall latency • Full fault tolerance still requires a data source that can replay data (e.g. Kafka)! • Architectural band aid?
  • 21. Worker Failure: Apache Storm • If a supervisor node fails, Nimbus will reassign that node's tasks to other nodes in the cluster. • Any tuples sent to a failed node will time out and be replayed (In Trident, any batches will be replayed). • Delivery guarantees dependent on a reliable data source.
  • 22. Data Source Reliability • A data source is considered unreliable if there is no means to replay a previously-received message. • A data source is considered reliable if it can somehow replay a message if processing fails at any point. • A data source is considered durable if it can replay any message or set of messages given the necessary selection criteria. ! (These are my terms.)
  • 23. Reliability Limitations: Apache Storm • Exactly once processing requires a durable data source. • At least once processing requires a reliable data source. • An unreliable data source can be wrapped to provide additional guarantees. • With durable and reliable sources, Storm will not drop data. • Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).
  • 24. Apache Storm Spouts Durable! Kafka
 
 
 
 
 
 Reliable! JMS RabbitMQ / AMQP Kestrel Amazon SQS Amazon Kinesis Unreliable! Twitter Scribe MongoDB
  • 25. Apache Storm Output (Bolts, Trident State) • Cassandra • HBase • HDFS • Kafka • Redis • Memcached • R • JMS • MongoDB • RDBMS
  • 26. Apache Storm + Kafka Apache Kafka is an ideal source for Storm topologies. It provides everything necessary for: • At most once processing • At least once processing • Exactly once processing Apache Storm includes Kafka spout implementations for all levels of reliability. Kafka Supports a wide variety of languages and integration points for both producers and consumers.
  • 27. Reliability Limitations: Spark Streaming • Fault tolerance and reliability guarantees require HDFS-backed data source. • Moving data to HDFS prior to stream processing introduces additional latency. • Network data sources (Kafka, etc.) are vulnerable to data loss in the event of a worker node failure. https://spark.apache.org/docs/latest/streaming-programming- guide.html#fault-tolerance-properties
  • 28. Performance “The main reason cited by Tathagata for Spark's performance gain over Storm is the aggregation of small records that occurs through the mechanics of RDDs.” http://www.cs.duke.edu/~kmoses/cps516/dstream.html In other words: Micro-Batching
  • 29. Performance http://www.cs.duke.edu/~kmoses/cps516/dstream.html Storm capped at 10k msgs/sec/node? Spark Streaming 40x faster than Storm? Others may disagree…
  • 31. Netty Transport • Introduced in Apache Storm 0.9.0 • Faster, pure Java alternative for 0MQ • Yahoo! Engineering announcement:
 http://yahooeng.tumblr.com/post/ 64758709722/making-storm-fly- with-netty • Performance Test Code:
 https://github.com/yahoo/storm- perf-test Netty 0mq
  • 32. STORM-297 • Introduced in Apache Storm 0.9.2-incubating • Big performance boost, especially for small messages • JIRA Discussion:
 https://issues.apache.org/jira/ browse/STORM-297 • Performance Test Code:
 https://github.com/yahoo/storm- perf-test
  • 33. Benchmarking Storm • 5 nodes on AWS (m1.large - not very powerful) • 1 ZooKeeper, 1 Nimbus, 3 Supervisors • Storm Core API and Trident API benchmarks • Is Trident API slower than Core API? https://github.com/ptgoetz/storm-benchmark
  • 34. Is Trident API slower than Core API? • On low-power hardware with 3 supervisor nodes… • Core API: ~150k msg./sec. with ~80 ms. latency • Trident API: ~300k msg./sec. with ~250 ms. latency • Higher throughput possible with increased latency • Better performance with bigger hardware
  • 35. Is Spark + Spark Streaming a "Lambda Architecture in a Box?" • No! • Lambda is a lot more than batch + streaming. • Lambda is powerful when applied correctly, but is not right for every use case. • Spark and Spark Streaming have overlapping programming models for batch and micro-batch. • The rest is up to you (as it is with Storm).
  • 36. Final Thoughts In general (not specific to Spark Streaming):! • Beware any claim that A is X times faster than B. • Performance is a matter of proper tuning for the use case at hand. • Any system can be hobbled to look bad in a benchmark.
  • 37. Recommendation • It is up to you, and your specific use case. • Consider fault tolerance. Is data loss acceptable? • Consider all facets and make informed decisions. • Rely on your own benchmarks