SlideShare uma empresa Scribd logo
1 de 16
Baixar para ler offline
REAL-TIME ANALYTICS
WITH KAFKA, CASSANDRA
& STORM
Dr. John Georgiadis
Modio Computing
Modio Computing
USE CASES
• Collecting/processing measurements from large
sensor networks (e.g. weather data).
• Aggregated processing of financial trading streams.
• Customer activity monitoring for advertising
purposes, fraud detection, etc.
• Real-time security log processing.
Modio Computing
SOLUTION APPROACH
• Real-time Updates: Employ streaming instead of batch analytics.
• Apache Storm: Large installation base. Streaming & micro-batch.
• Apache Spark: Uniform API for batch & micro-batch. On top of
YARN/HDFS. Micro-batch less mature but catching-up quickly.
• Large data sets +Time Series + Write-Intensive + Data Expiration =
Apache Cassandra
Modio Computing
ARCHITECTURE
Modio Computing
APACHE KAFKA
• N Nodes
• TTopics
• Replication Factor: Defines high availability
• Partitions:They define parallelism level.A single consumer per
partition.
• Consumer discovers cluster nodes through Zookeeper
• Consumer partition state is just an integer: the partition offset.
Modio Computing
STORM
• Storm is a distributed computing platform. In Storm a distributed
computation is a directed graph of interconnected processors (topology)
that exchange messages.
• Spouts: Processors that inject messages into the topology.
• Bolts: Processors that process messages including sending to 3rd parties
(.e.g persistence).
• Trident: High-level operations on message batches. Support batch replay
in case of failure.Translates to a graph of low-level spouts and bolts.
Modio Computing
STORM :: NIMBUS
• A single controller (Nimbus) 

where topologies are submitted. 

Nimbus breaks topologies in tasks and 

forwards to supervisors which spawn 

one or more workers(processes) per 

task.
Nimbus redistributes tasks in case a supervisor fails.
Nimbus is not HA. If Nimbus fails, running topologies are not
affected.
Modio Computing
STORM :: SUPERVISOR
• 1 supervisor per host.
• Supervisor registers with ZK at startup and thus it’s discoverable
by Nimbus.
• Supervisor spawns Worker JVMs: one process per topology.
• JAR submitted to Nimbus is copied to Worker classpath.
• When a Supervisor dies, all Worker tasks are migrated to the
remaining Supervisors.
Modio Computing
STORM ::TRIDENT
• When to use Micro-batch (akaTrident) instead 

of Streaming.
• Millisecond latency not required.Typical 

Trident latency threshold: 500ms.
Allows batch mode persistence operations.
High-level abstractions: partitionBy, partitionAggregate, stateQuery,
partitionPersist.
Batch processing timeout/exception will cause a replay of the batch
provided the Spout supports replays (Kafka does):At-least-once semantics.
Modio Computing
STORM :: PARALLELISM
• Parallelism = Number of threads executing a topology cluster-wide.
• Parallelism <= CPU threads/worker x Workers
• Define per-topology max #workers (explicitly) and max parallelism (implicitly).
• Define explicitly topology step parallelism. 

Max parallelism = Σ(step parallelism).
• Trident merges multiple steps into the same thread/node. Last parallelism statement is the
effective parallelism.
• Repartition operations define step merging boundaries.
• Repartition operations (shuffle, partitionBy, broadcast, etc.) imply network transfer and are
expensive. In some cases they are disastrous to performance!
Modio Computing
STORM :: PERFORMANCETUNING
• Spouts must match upstream parallelism: one spout per Kafka partition.
• Little’s Law: Batch Size =Throughput x Latency
• Adjust batch size: (Kafka Partitions) x (Kafka fetch size)

Larger batch size = {higher throughput, higher latency}

Increase batch size gradually until latency starts increasing sharply.
• Identify the slowest stage (I/O or CPU bound):
• You can’t have better throughput than the throughput of your slowest stage.
• You can’t have better latency than the sum of individual latencies.
• If CPU bound, increase parallelism. If I/O bound increase downstream (i.e storage) capacity.
Modio Computing
CASSANDRA ::THE GOOD
• Great write performance. Decent read performance.
• Write latency: 20μs-120μs
• 15K writes/sec on a single node
• Extremely stable.Very low record of data corruption.
• Decentralized setup: all cluster nodes have the same setup.
• Multi-datacenter setups.
• Configurable consistency of updates: ONE, QUORUM,ALL.
• TTL per cell (row & column).
• Detailed metrics: #operations, latencies, thread pools, memory, cache performance.
Modio Computing
CASSANDRA ::THE “FEATURES”
• All partition keys must be set in queries.
• All primary keys preceding an initialized primary key with a value must also be initialized in
queries.
• TTL is not allowed on cells containing counters.
• NULL values are not supported on primary keys.
• Range queries can only be applied on the last column of the composite primary key that
appears in the query.
• Disjunction operator (OR) is not available.The IN keyword can be used in some cases instead.
• Row counting is a very expensive operation.
Modio Computing
CASSANDRA :: PERFORMANCE
• Design the schema around the partition key.
• Keep each partition size small (no more than a few 100s entry) as reading will fetch the whole partition.
• Leverage Key cache
• Avoid making time fragments part of the partition key as this will direct all activity to the node that is the
partition owner at a given date.
• Query/Update Plan:
• Avoid range queries and the IN operator as it requires contacting multiple nodes and assembling the
results at the coordinator node.
• Use prepared statements to avoid repeated statement parsing.
• Prefer async writes combined with a max pending statements threshold.
• Best performance out of batches containing statements with the same partition key.
Modio Computing
CASSANDRA :: CLUSTER
• One or more data nodes are also “seed”nodes 

acting as the membership gatekeepers.
• Table sharding across the cluster based on the 

partition key hash (token).
• Table replication according to replication 

factor (RF). Configurable per keyspace (database).
• The Java driver has several load balancing approaches:
• Token-aware: sends each statement to the node that actually will store it. Random selection
amongst the nodes for a given replica set.
• Latency-aware: sends each statement to the node with the fastest response.
• Round-robin & custom load balancers supported.
Modio Computing
• Kafka
• N-way replication: N-1 node failures.
• Clients dynamically reconfigured if accessing through Zookeeper.
• Storm
• For cluster size N, X supervisor failures provided (N-X) nodes have memory 

to accommodate X JVM worker processes.
• Incomplete batches replayed:At-least-once semantics.
• Cassandra
• N-way replication: N-1 node failures if using ONE consistency, if using QUORUM consistency.
• Clients require a list of all cluster nodes.
• Zookeeper
• Majority voting is required:At most F failures in cluster with 2F+1 nodes. Leader re-election very fast (200ms).
• Sizes bigger than 3-5 not recommended due to decreasing write performance.
• If majority voting is lost, Storm will stop. Kafka will fail to commit client offsets. If majority is regained Storm will resume. Kafka
brokers will resume in most cases.
FAILURE SCENARIOS

Mais conteúdo relacionado

Mais procurados

Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
DataWorks Summit
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
André Dias
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

Mais procurados (19)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Spark vs storm
Spark vs stormSpark vs storm
Spark vs storm
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
How Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm PipelinesHow Spotify scales Apache Storm Pipelines
How Spotify scales Apache Storm Pipelines
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Slide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache StormSlide #1:Introduction to Apache Storm
Slide #1:Introduction to Apache Storm
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 

Destaque

Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxx
Nathan Bijnens
 

Destaque (20)

Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Evernote
EvernoteEvernote
Evernote
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
SOA & Big Data
SOA & Big DataSOA & Big Data
SOA & Big Data
 
Time series with apache cassandra strata
Time series with apache cassandra   strataTime series with apache cassandra   strata
Time series with apache cassandra strata
 
Big data and its impact on SOA
Big data and its impact on SOABig data and its impact on SOA
Big data and its impact on SOA
 
10 Productivity Tips From Hootsuite & Evernote
10 Productivity Tips From Hootsuite & Evernote10 Productivity Tips From Hootsuite & Evernote
10 Productivity Tips From Hootsuite & Evernote
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threads
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
 
Real time and reliable processing with Apache Storm
Real time and reliable processing with Apache StormReal time and reliable processing with Apache Storm
Real time and reliable processing with Apache Storm
 
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...Cassandra as an event sourced journal for big data analytics Cassandra Summit...
Cassandra as an event sourced journal for big data analytics Cassandra Summit...
 
a real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxxa real-time architecture using Hadoop and Storm at Devoxx
a real-time architecture using Hadoop and Storm at Devoxx
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, ScalaLambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
 

Semelhante a Real-Time Analytics with Kafka, Cassandra and Storm

M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
Edward Capriolo
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
DataStax
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
Boris Yen
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Vigyan Jain
 

Semelhante a Real-Time Analytics with Kafka, Cassandra and Storm (20)

Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
 
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
 
CPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performanceCPN302 your-linux-ami-optimization-and-performance
CPN302 your-linux-ami-optimization-and-performance
 
Apache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentialsApache Cassandra multi-datacenter essentials
Apache Cassandra multi-datacenter essentials
 
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
 
Basics of JVM Tuning
Basics of JVM TuningBasics of JVM Tuning
Basics of JVM Tuning
 
Cassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentialsCassandra multi-datacenter operations essentials
Cassandra multi-datacenter operations essentials
 
Devops kc
Devops kcDevops kc
Devops kc
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)EVCache & Moneta (GoSF)
EVCache & Moneta (GoSF)
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 

Último

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 

Último (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
The Guide to Integrating Generative AI into Unified Continuous Testing Platfo...
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Pharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodologyPharm-D Biostatistics and Research methodology
Pharm-D Biostatistics and Research methodology
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456LEVEL 5   - SESSION 1 2023 (1).pptx - PDF 123456
LEVEL 5 - SESSION 1 2023 (1).pptx - PDF 123456
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 

Real-Time Analytics with Kafka, Cassandra and Storm

  • 1. REAL-TIME ANALYTICS WITH KAFKA, CASSANDRA & STORM Dr. John Georgiadis Modio Computing
  • 2. Modio Computing USE CASES • Collecting/processing measurements from large sensor networks (e.g. weather data). • Aggregated processing of financial trading streams. • Customer activity monitoring for advertising purposes, fraud detection, etc. • Real-time security log processing.
  • 3. Modio Computing SOLUTION APPROACH • Real-time Updates: Employ streaming instead of batch analytics. • Apache Storm: Large installation base. Streaming & micro-batch. • Apache Spark: Uniform API for batch & micro-batch. On top of YARN/HDFS. Micro-batch less mature but catching-up quickly. • Large data sets +Time Series + Write-Intensive + Data Expiration = Apache Cassandra
  • 5. Modio Computing APACHE KAFKA • N Nodes • TTopics • Replication Factor: Defines high availability • Partitions:They define parallelism level.A single consumer per partition. • Consumer discovers cluster nodes through Zookeeper • Consumer partition state is just an integer: the partition offset.
  • 6. Modio Computing STORM • Storm is a distributed computing platform. In Storm a distributed computation is a directed graph of interconnected processors (topology) that exchange messages. • Spouts: Processors that inject messages into the topology. • Bolts: Processors that process messages including sending to 3rd parties (.e.g persistence). • Trident: High-level operations on message batches. Support batch replay in case of failure.Translates to a graph of low-level spouts and bolts.
  • 7. Modio Computing STORM :: NIMBUS • A single controller (Nimbus) 
 where topologies are submitted. 
 Nimbus breaks topologies in tasks and 
 forwards to supervisors which spawn 
 one or more workers(processes) per 
 task. Nimbus redistributes tasks in case a supervisor fails. Nimbus is not HA. If Nimbus fails, running topologies are not affected.
  • 8. Modio Computing STORM :: SUPERVISOR • 1 supervisor per host. • Supervisor registers with ZK at startup and thus it’s discoverable by Nimbus. • Supervisor spawns Worker JVMs: one process per topology. • JAR submitted to Nimbus is copied to Worker classpath. • When a Supervisor dies, all Worker tasks are migrated to the remaining Supervisors.
  • 9. Modio Computing STORM ::TRIDENT • When to use Micro-batch (akaTrident) instead 
 of Streaming. • Millisecond latency not required.Typical 
 Trident latency threshold: 500ms. Allows batch mode persistence operations. High-level abstractions: partitionBy, partitionAggregate, stateQuery, partitionPersist. Batch processing timeout/exception will cause a replay of the batch provided the Spout supports replays (Kafka does):At-least-once semantics.
  • 10. Modio Computing STORM :: PARALLELISM • Parallelism = Number of threads executing a topology cluster-wide. • Parallelism <= CPU threads/worker x Workers • Define per-topology max #workers (explicitly) and max parallelism (implicitly). • Define explicitly topology step parallelism. 
 Max parallelism = Σ(step parallelism). • Trident merges multiple steps into the same thread/node. Last parallelism statement is the effective parallelism. • Repartition operations define step merging boundaries. • Repartition operations (shuffle, partitionBy, broadcast, etc.) imply network transfer and are expensive. In some cases they are disastrous to performance!
  • 11. Modio Computing STORM :: PERFORMANCETUNING • Spouts must match upstream parallelism: one spout per Kafka partition. • Little’s Law: Batch Size =Throughput x Latency • Adjust batch size: (Kafka Partitions) x (Kafka fetch size)
 Larger batch size = {higher throughput, higher latency}
 Increase batch size gradually until latency starts increasing sharply. • Identify the slowest stage (I/O or CPU bound): • You can’t have better throughput than the throughput of your slowest stage. • You can’t have better latency than the sum of individual latencies. • If CPU bound, increase parallelism. If I/O bound increase downstream (i.e storage) capacity.
  • 12. Modio Computing CASSANDRA ::THE GOOD • Great write performance. Decent read performance. • Write latency: 20μs-120μs • 15K writes/sec on a single node • Extremely stable.Very low record of data corruption. • Decentralized setup: all cluster nodes have the same setup. • Multi-datacenter setups. • Configurable consistency of updates: ONE, QUORUM,ALL. • TTL per cell (row & column). • Detailed metrics: #operations, latencies, thread pools, memory, cache performance.
  • 13. Modio Computing CASSANDRA ::THE “FEATURES” • All partition keys must be set in queries. • All primary keys preceding an initialized primary key with a value must also be initialized in queries. • TTL is not allowed on cells containing counters. • NULL values are not supported on primary keys. • Range queries can only be applied on the last column of the composite primary key that appears in the query. • Disjunction operator (OR) is not available.The IN keyword can be used in some cases instead. • Row counting is a very expensive operation.
  • 14. Modio Computing CASSANDRA :: PERFORMANCE • Design the schema around the partition key. • Keep each partition size small (no more than a few 100s entry) as reading will fetch the whole partition. • Leverage Key cache • Avoid making time fragments part of the partition key as this will direct all activity to the node that is the partition owner at a given date. • Query/Update Plan: • Avoid range queries and the IN operator as it requires contacting multiple nodes and assembling the results at the coordinator node. • Use prepared statements to avoid repeated statement parsing. • Prefer async writes combined with a max pending statements threshold. • Best performance out of batches containing statements with the same partition key.
  • 15. Modio Computing CASSANDRA :: CLUSTER • One or more data nodes are also “seed”nodes 
 acting as the membership gatekeepers. • Table sharding across the cluster based on the 
 partition key hash (token). • Table replication according to replication 
 factor (RF). Configurable per keyspace (database). • The Java driver has several load balancing approaches: • Token-aware: sends each statement to the node that actually will store it. Random selection amongst the nodes for a given replica set. • Latency-aware: sends each statement to the node with the fastest response. • Round-robin & custom load balancers supported.
  • 16. Modio Computing • Kafka • N-way replication: N-1 node failures. • Clients dynamically reconfigured if accessing through Zookeeper. • Storm • For cluster size N, X supervisor failures provided (N-X) nodes have memory 
 to accommodate X JVM worker processes. • Incomplete batches replayed:At-least-once semantics. • Cassandra • N-way replication: N-1 node failures if using ONE consistency, if using QUORUM consistency. • Clients require a list of all cluster nodes. • Zookeeper • Majority voting is required:At most F failures in cluster with 2F+1 nodes. Leader re-election very fast (200ms). • Sizes bigger than 3-5 not recommended due to decreasing write performance. • If majority voting is lost, Storm will stop. Kafka will fail to commit client offsets. If majority is regained Storm will resume. Kafka brokers will resume in most cases. FAILURE SCENARIOS