SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
Architectural
Patterns for
Streaming
applications
Strata+Hadoop World, Singapore – December 02, 2015
tiny.cloudera.com/streaming-singapore
tiny.cloudera.com/streaming-singapore-questions
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
2
About the book
• @hadooparchbook
• hadooparchitecturebook.com
• github.com/hadooparchitecturebook
• slideshare.com/hadooparchbook
Questions? tiny.cloudera.com/streaming-singapore-questions
3
About the presenters
• Principal Solutions Architect at
Cloudera
• Done Hadoop for 6 years
– Worked with > 70 companies in 8
countries
• Previously, lead architect at FINRA
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Contributor to Apache Hadoop,
HBase, Flume, Avro, Pig and Spark
• Marvel fan boy, runner
• Software Engineer at Cloudera,
working on Spark
• Committer on Apache Bigtop, PMC
member on Apache Sentry
(incubating)
• Contributor to Apache Hadoop,
Spark, Hive, Sqoop, Pig and Flume
Questions? tiny.cloudera.com/streaming-singapore-questions
Ted Malaska Mark Grover
4
Goal
5
Understand common use-
cases for streaming and
their architectures
6
What is streaming?
7
When	to	stream,	and	when	not	to
• We are looking for a SLA sweet spot
• Multi milliseconds to seconds
• Not minutes
• Not constant low milliseconds or under
• Doesn’t come for free
Questions? tiny.cloudera.com/streaming-singapore-questions
8
Use-cases for
streaming
9
Use-case categories
• Ingestion
– Transformation
– Decision (e.g. Anomaly detection)
• Simple counts
– Lambda, etc.
• Advanced usage
– Machine Learning
– Windowing
Questions? tiny.cloudera.com/streaming-singapore-questions
10
Ingestion
11
What is ingestion?
Questions? tiny.cloudera.com/streaming-singapore-questions
IngestSource Systems
Destination system
12
But there multiple sources
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest
13
But..
• Sources, sinks, ingestion channels may go down
• Sources and sinks may be producing/consuming at different rates
• Regular maintenance windows may need to be scheduled
• We need a resilient message broker
Questions? tiny.cloudera.com/streaming-singapore-questions
14
Need for a message broker
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest Extract
Message broker
15
Kafka
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Destination systemSource System 2
Source System 3
Ingest Extract
Message broker
16
But ‘queue’ doesn’t ‘push’
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Message broker
17
Streaming data ingestion process
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka Connect
Apache Flume
Message broker
18
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
Connect
Apache
Flume
Message broker
19
Transforming data
in flight
20
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
21
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
22
Two types of transformations
Atomic
• Need to work with one event at a
time
• Example – mask a credit card
number
With context
• Need to refer to external context
• Example - convert zip code to state,
by looking up a cache
Questions? tiny.cloudera.com/streaming-singapore-questions
23
Atomic transformations
• Require no context
• Can be simply done within Flume interceptors, Kafka connect or
Spark streaming
Questions? tiny.cloudera.com/streaming-singapore-questions
24
Flume Interceptors
• Mask fields
• Validate information
against external source
• Extract fields
• Modify data format
• Filter or split events
Questions? tiny.cloudera.com/streaming-singapore-questions
25
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Kafka
connect
Apache
Flume
Message broker
Can be used to
do simple
transformations
26
Transformations with context
Questions? tiny.cloudera.com/streaming-singapore-questions
27
Exactly once, at
least once, at most
once
(In the context of data ingestion)
28
Streaming architecture for ingestion
Questions? tiny.cloudera.com/streaming-singapore-questions
Source System 1
Storage
systemSource System 2
Source System 3
Ingest Extract
Streaming
ingestion
process
Push
Copycat
Apache
Flume
Message broker
Can be used to
do simple
transformations
29
Semantic types
• At most once
– Not good for many cases
– Only where performance/SLA is more important than accuracy
• Exactly once
– Expensive to achieve but desirable
• At least once
– Easiest to achieve
Questions? tiny.cloudera.com/streaming-singapore-questions
30
Categories of storage systems
“Puts” based
• Can be re-inserted without side
effects since re-inserted record will
have duplicate key
“Appends” based
• Can not be re-inserted
Questions? tiny.cloudera.com/streaming-singapore-questions
31
How to achieve exactly once?
• For “puts” based storage systems
– At least once is enough (keys have to be unique though i.e. primary key)
– Re-inserted records will have duplicate keys
– Will simply overwrite the exist record with the same value
• For “appends” based storage systems (e.g. HDFS)
– Still easiest to do at least once
– Need to de-duplicate before processing
Questions? tiny.cloudera.com/streaming-singapore-questions
32
Anomaly detection
systems
33Questions? tiny.cloudera.com/streaming-singapore-questions
Hadoop Cluster II
Storage
Batch Processing
Hadoop Cluster I
Flume
(Sink)
HBase and/or
Memory Store
HDFS
HBase
Impala
Map/Reduce
Spark
Automated & Manual
Analytical Adjustments and
Pattern detection
Fetching & Updating Profiles/Rules
Batch Time
Adjustments
NRT/Stream Processing
Spark Streaming
Adjusting
NRT stats
Kafka
Events
Reporting
Flume
(Source)
Interceptor(Rules)
Flume
(Source)
Flume
(Source)
Interceptor (Rules)
Kafka
Alerts/Events
Flume Channel
Events
Alerts
Hadoop Cluster I
HBase and/or
Memory Store
34
Counting
35
Streaming	and	Counting
• Counting is easy right?
• Back to Only once
Questions? tiny.cloudera.com/streaming-singapore-questions
36
We	started	with	Lambda
Pipe
Speed Layer
Batch Layer
Persist Results
Speed Results
Batch Results
Serving Layer
Questions? tiny.cloudera.com/streaming-singapore-questions
37
Why	did	Streaming	Suck
• Increments with Cassandra
• Double increment
• No strong consistency
• Storm with out Kafka
• Not only once
• Not at least once
• Batch would have to re-process EVERY record to remove
dups
Questions? tiny.cloudera.com/streaming-singapore-questions
38
We	have	come	a	long	way
• We don’t have to use Increments any more and we can
have consistency
• HBase
• We can have state in our streaming platform
• Spark Streaming
• We don’t lose data
• Spark Streaming
• Kafka
• Other options
• Full universe of Deduping
• Again HBase with versions
Questions? tiny.cloudera.com/streaming-singapore-questions
39
Increments
Questions? tiny.cloudera.com/streaming-singapore-questions
40
Puts	with	State
Questions? tiny.cloudera.com/streaming-singapore-questions
41
Advanced	Streaming
• Ad-hoc will produce Identify Value
• Ad-hoc will become batch
• The value will demand less latency on batch
• Batch will become Streaming
Questions? tiny.cloudera.com/streaming-singapore-questions
42
Advanced	Streaming
• Requirements for Ideal Batch to Streaming frameworks
• Something that can snap both paradigms
• Something that can use the tools of Ad-hoc
• SQL
• MlLib
• R
• Scala
• Java
• Development through a common IDE
• Debugging
• Unit Testing
• Common deployment model
Questions? tiny.cloudera.com/streaming-singapore-questions
43
Spark Streaming Example
Questions? tiny.cloudera.com/streaming-singapore-questions
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
44
Spark Streaming Example
Questions? tiny.cloudera.com/streaming-singapore-questions
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
45
Advanced usage
46
Advanced	Streaming
• In Spark Streaming
• A DStream is a collection of RDD with respect to micro batch
intervals
• If we can access RDDs in Spark Streaming
• We can convert to Vectors
• KMeans
• Principal component analysis
• We can convert to LabeledPoint
• NaiveBayes
• Random Forest
• Linear Support Vector Machines
• We can convert to a DataFrames
• SQL
• R
Questions? tiny.cloudera.com/streaming-singapore-questions
47
Wrap-up
48
Understand common
use-cases for streaming and
their architectures
Our original goal
49
Common streaming use-cases
• Ingestion
– Transformation
– Decision (e.g. Anomaly detection)
• Simple counts
– Lambda, etc.
• Advanced usage
– Machine Learning
– Windowing
Questions? tiny.cloudera.com/streaming-singapore-questions
50
Free books!
• Book signings
– Wednesday (today), 5:30 PM at O’Reilly booth
– Thursday (tomorrow), 3:15 PM at Cloudera booth
• Please leave us a review!
Questions? tiny.cloudera.com/streaming-singapore-questions
51
Stay in touch!
Mark Grover | @mark_grover
Ted Malaska | @TedMalaska
@hadooparchbook
tiny.cloudera.com/streaming-singapore
tiny.cloudera.com/streaming-singapore-questions
hadooparchitecturebook.com

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Hadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata LondonHadoop Application Architectures tutorial - Strata London
Hadoop Application Architectures tutorial - Strata London
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Architectural considerations for Hadoop Applications
Architectural considerations for Hadoop ApplicationsArchitectural considerations for Hadoop Applications
Architectural considerations for Hadoop Applications
 
Application Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User GroupApplication Architectures with Hadoop - UK Hadoop User Group
Application Architectures with Hadoop - UK Hadoop User Group
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Spark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of DatabricksSpark after Dark by Chris Fregly of Databricks
Spark after Dark by Chris Fregly of Databricks
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorialStrata NY 2014 - Architectural considerations for Hadoop applications tutorial
Strata NY 2014 - Architectural considerations for Hadoop applications tutorial
 

Semelhante a Architectural Patterns for Streaming Applications

Providence: rapid vulnerability prevention
Providence: rapid vulnerability preventionProvidence: rapid vulnerability prevention
Providence: rapid vulnerability prevention
Salesforce Engineering
 

Semelhante a Architectural Patterns for Streaming Applications (20)

Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Fraud Detection with Hadoop
Fraud Detection with HadoopFraud Detection with Hadoop
Fraud Detection with Hadoop
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015Being Ready for Apache Kafka - Apache: Big Data Europe 2015
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
 
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Apache Deep Learning 201
Apache Deep Learning 201Apache Deep Learning 201
Apache Deep Learning 201
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to SurviveHadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 
Providence: rapid vulnerability prevention
Providence: rapid vulnerability preventionProvidence: rapid vulnerability prevention
Providence: rapid vulnerability prevention
 
CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton CI-CD Jenkins, GitHub Actions, Tekton
CI-CD Jenkins, GitHub Actions, Tekton
 

Mais de hadooparchbook

Mais de hadooparchbook (10)

Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Strata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applicationsStrata EU tutorial - Architectural considerations for hadoop applications
Strata EU tutorial - Architectural considerations for hadoop applications
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Último (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Architectural Patterns for Streaming Applications

  • 1. Architectural Patterns for Streaming applications Strata+Hadoop World, Singapore – December 02, 2015 tiny.cloudera.com/streaming-singapore tiny.cloudera.com/streaming-singapore-questions Mark Grover | @mark_grover Ted Malaska | @TedMalaska
  • 2. 2 About the book • @hadooparchbook • hadooparchitecturebook.com • github.com/hadooparchitecturebook • slideshare.com/hadooparchbook Questions? tiny.cloudera.com/streaming-singapore-questions
  • 3. 3 About the presenters • Principal Solutions Architect at Cloudera • Done Hadoop for 6 years – Worked with > 70 companies in 8 countries • Previously, lead architect at FINRA • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark • Contributor to Apache Hadoop, HBase, Flume, Avro, Pig and Spark • Marvel fan boy, runner • Software Engineer at Cloudera, working on Spark • Committer on Apache Bigtop, PMC member on Apache Sentry (incubating) • Contributor to Apache Hadoop, Spark, Hive, Sqoop, Pig and Flume Questions? tiny.cloudera.com/streaming-singapore-questions Ted Malaska Mark Grover
  • 5. 5 Understand common use- cases for streaming and their architectures
  • 7. 7 When to stream, and when not to • We are looking for a SLA sweet spot • Multi milliseconds to seconds • Not minutes • Not constant low milliseconds or under • Doesn’t come for free Questions? tiny.cloudera.com/streaming-singapore-questions
  • 9. 9 Use-case categories • Ingestion – Transformation – Decision (e.g. Anomaly detection) • Simple counts – Lambda, etc. • Advanced usage – Machine Learning – Windowing Questions? tiny.cloudera.com/streaming-singapore-questions
  • 11. 11 What is ingestion? Questions? tiny.cloudera.com/streaming-singapore-questions IngestSource Systems Destination system
  • 12. 12 But there multiple sources Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Destination systemSource System 2 Source System 3 Ingest
  • 13. 13 But.. • Sources, sinks, ingestion channels may go down • Sources and sinks may be producing/consuming at different rates • Regular maintenance windows may need to be scheduled • We need a resilient message broker Questions? tiny.cloudera.com/streaming-singapore-questions
  • 14. 14 Need for a message broker Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Destination systemSource System 2 Source System 3 Ingest Extract Message broker
  • 15. 15 Kafka Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Destination systemSource System 2 Source System 3 Ingest Extract Message broker
  • 16. 16 But ‘queue’ doesn’t ‘push’ Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Message broker
  • 17. 17 Streaming data ingestion process Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka Connect Apache Flume Message broker
  • 18. 18 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka Connect Apache Flume Message broker
  • 20. 20 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka connect Apache Flume Message broker
  • 21. 21 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka connect Apache Flume Message broker Can be used to do simple transformations
  • 22. 22 Two types of transformations Atomic • Need to work with one event at a time • Example – mask a credit card number With context • Need to refer to external context • Example - convert zip code to state, by looking up a cache Questions? tiny.cloudera.com/streaming-singapore-questions
  • 23. 23 Atomic transformations • Require no context • Can be simply done within Flume interceptors, Kafka connect or Spark streaming Questions? tiny.cloudera.com/streaming-singapore-questions
  • 24. 24 Flume Interceptors • Mask fields • Validate information against external source • Extract fields • Modify data format • Filter or split events Questions? tiny.cloudera.com/streaming-singapore-questions
  • 25. 25 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Kafka connect Apache Flume Message broker Can be used to do simple transformations
  • 26. 26 Transformations with context Questions? tiny.cloudera.com/streaming-singapore-questions
  • 27. 27 Exactly once, at least once, at most once (In the context of data ingestion)
  • 28. 28 Streaming architecture for ingestion Questions? tiny.cloudera.com/streaming-singapore-questions Source System 1 Storage systemSource System 2 Source System 3 Ingest Extract Streaming ingestion process Push Copycat Apache Flume Message broker Can be used to do simple transformations
  • 29. 29 Semantic types • At most once – Not good for many cases – Only where performance/SLA is more important than accuracy • Exactly once – Expensive to achieve but desirable • At least once – Easiest to achieve Questions? tiny.cloudera.com/streaming-singapore-questions
  • 30. 30 Categories of storage systems “Puts” based • Can be re-inserted without side effects since re-inserted record will have duplicate key “Appends” based • Can not be re-inserted Questions? tiny.cloudera.com/streaming-singapore-questions
  • 31. 31 How to achieve exactly once? • For “puts” based storage systems – At least once is enough (keys have to be unique though i.e. primary key) – Re-inserted records will have duplicate keys – Will simply overwrite the exist record with the same value • For “appends” based storage systems (e.g. HDFS) – Still easiest to do at least once – Need to de-duplicate before processing Questions? tiny.cloudera.com/streaming-singapore-questions
  • 33. 33Questions? tiny.cloudera.com/streaming-singapore-questions Hadoop Cluster II Storage Batch Processing Hadoop Cluster I Flume (Sink) HBase and/or Memory Store HDFS HBase Impala Map/Reduce Spark Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles/Rules Batch Time Adjustments NRT/Stream Processing Spark Streaming Adjusting NRT stats Kafka Events Reporting Flume (Source) Interceptor(Rules) Flume (Source) Flume (Source) Interceptor (Rules) Kafka Alerts/Events Flume Channel Events Alerts Hadoop Cluster I HBase and/or Memory Store
  • 35. 35 Streaming and Counting • Counting is easy right? • Back to Only once Questions? tiny.cloudera.com/streaming-singapore-questions
  • 36. 36 We started with Lambda Pipe Speed Layer Batch Layer Persist Results Speed Results Batch Results Serving Layer Questions? tiny.cloudera.com/streaming-singapore-questions
  • 37. 37 Why did Streaming Suck • Increments with Cassandra • Double increment • No strong consistency • Storm with out Kafka • Not only once • Not at least once • Batch would have to re-process EVERY record to remove dups Questions? tiny.cloudera.com/streaming-singapore-questions
  • 38. 38 We have come a long way • We don’t have to use Increments any more and we can have consistency • HBase • We can have state in our streaming platform • Spark Streaming • We don’t lose data • Spark Streaming • Kafka • Other options • Full universe of Deduping • Again HBase with versions Questions? tiny.cloudera.com/streaming-singapore-questions
  • 41. 41 Advanced Streaming • Ad-hoc will produce Identify Value • Ad-hoc will become batch • The value will demand less latency on batch • Batch will become Streaming Questions? tiny.cloudera.com/streaming-singapore-questions
  • 42. 42 Advanced Streaming • Requirements for Ideal Batch to Streaming frameworks • Something that can snap both paradigms • Something that can use the tools of Ad-hoc • SQL • MlLib • R • Scala • Java • Development through a common IDE • Debugging • Unit Testing • Common deployment model Questions? tiny.cloudera.com/streaming-singapore-questions
  • 43. 43 Spark Streaming Example Questions? tiny.cloudera.com/streaming-singapore-questions 1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
  • 44. 44 Spark Streaming Example Questions? tiny.cloudera.com/streaming-singapore-questions 1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. SSC.start()
  • 46. 46 Advanced Streaming • In Spark Streaming • A DStream is a collection of RDD with respect to micro batch intervals • If we can access RDDs in Spark Streaming • We can convert to Vectors • KMeans • Principal component analysis • We can convert to LabeledPoint • NaiveBayes • Random Forest • Linear Support Vector Machines • We can convert to a DataFrames • SQL • R Questions? tiny.cloudera.com/streaming-singapore-questions
  • 48. 48 Understand common use-cases for streaming and their architectures Our original goal
  • 49. 49 Common streaming use-cases • Ingestion – Transformation – Decision (e.g. Anomaly detection) • Simple counts – Lambda, etc. • Advanced usage – Machine Learning – Windowing Questions? tiny.cloudera.com/streaming-singapore-questions
  • 50. 50 Free books! • Book signings – Wednesday (today), 5:30 PM at O’Reilly booth – Thursday (tomorrow), 3:15 PM at Cloudera booth • Please leave us a review! Questions? tiny.cloudera.com/streaming-singapore-questions
  • 51. 51 Stay in touch! Mark Grover | @mark_grover Ted Malaska | @TedMalaska @hadooparchbook tiny.cloudera.com/streaming-singapore tiny.cloudera.com/streaming-singapore-questions hadooparchitecturebook.com