SlideShare uma empresa Scribd logo
1 de 49
Lambda Architecture
with Apache Spark
IMAGE
About Me
https://ua.linkedin.com/in/tarasmatyashovsky
Apache Hadoop: A Brief History
http://www.slideshare.net/fadicce/hadoop-user-group-uae-meeting
A lot of customers implemented
successful Hadoop-based M/R pipelines
which are operating today
Examples from Real Life
• Oozie workflow, operates daily and processes up to
150 TB to generate analytics
• bash managed workflow, operates daily and processes
up to 8 TB to generate analytics
It Is 2016 Now!
• Making decisions faster is more valuable
• Kafka, Storm, Trident, Samza, Spark, Flink, Parquet,
Avro, Cloud providers, etc.
Examples from Real Life
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
Lambda Architecture
A data-processing architecture
designed to handle massive quantities of data
by taking advantage of both
batch and stream processing methods
http://lambda-architecture.net/
https://www.manning.com/books/big-data
https://www.manning.com/books/big-data
Layers of Lambda Architecture
Batch layer
• manages the master dataset (an immutable, append-only set of
raw data)
• pre-computes the batch views
Serving layer
• indexes the batch views so that they can be queried in ad-hoc with
low-latency
Speed layer
• deals with recent data only
http://lambda-architecture.net/
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Relevance of Data
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
query =
real time view =
batch view =
function(batch view, real time view)
function(real time view, new data)
function(all data)
Trade-offs
Full recomputation vs. partial recomputation
e.g. using Bloom filters
Recomputational algorithms vs. incremental algorithms
Additive algorithms vs. approximation algorithms
e.g. HyperLogLog for count-distinct problem
Implementation of Lambda Architecture
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
Can be considered as an integrated solution
for processing on all lambda architecture layers
Apache Spark: a Brief History
Why Apache Spark?
As of mid 2014,
Spark is the most active Big Data project
http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Total Contributors
Core Concepts
automatically distribute data across cluster
and
parallelize operations performed on them
Components Stack
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
50% users consider most important part of Spark
Spark Streaming
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Streaming Architecture
• micro-batch architecture
• series of batch computations on small chunks of data
• batch interval is configurable
• exactly once semantics
http://spark.apache.org/docs/latest/streaming-programming-guide.html
Streaming Architecture
http://spark.apache.org/docs/latest/streaming-programming-guide.html
https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
DStream as a Continuous Series of RDDs
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
http://spark.apache.org/docs/latest/streaming-programming-guide.html#discretized-streams-dstreams
Provide hashtags statistics
used in a #morningatlohika tweets
All time till today + right now
Sample Application
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Batch View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
6
12
3
4
7
6
15
14
5
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Real-time View
“Cool presentation by @tmatyashovsky about
#lambda #architecture using #apache #spark
at #morningatlohika”
apache –
architecture –
morningatlohika –
lambda –
spark –
1
1
1
1
1
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Batch View + Real-time View
apache –
architecture –
aws –
java –
jeeconf –
lambda –
morningatlohika –
simpleworkflow –
spark –
7
13
3
4
7
7
16
14
6
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Simplified Steps
• Create batch view (.parquet) via Apache Spark
• Cache batch view in Apache Spark
• Start streaming application connected to Twitter
• Focus on real-time #morningatlohika tweets*
• Build incremental real-time views
• Query, i.e. merge batch and real-time views on a fly
* Stream from file system (used for testing) can be used as a backup
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
Demo Time
https://github.com/tmatyashovsky/lambda-architecture-jeeconf-kyiv
http://shop.oreilly.com/product/0636920028512.do
http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
Structured Streaming in Spark 2.0
The simplest way to perform streaming analytics
is not having to reason about streaming
Static DataFrame API = Infinite DataFrame API
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
Structured Streaming
• Introduces streaming API built on top of Spark SQL
• Unifies streaming, interactive and batch queries
logs = context.read.format("json")
.stream("s3://logs")
logs.groupBy(logs.user_id)
.agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
https://www.youtube.com/watch?v=oXkxXDG0gNk
Instead of Epilogue
http://milinda.pathirage.org/kappa-architecture.com/
http://milinda.pathirage.org/kappa-architecture.com/
Taras Matyashovsky
taras.matyashovsky@gmail.com
@tmatyashovsky
http://www.filevych.com/
Thank you!
References
http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark
https://www.manning.com/books/big-data
Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly
Media)
http://spark.apache.org/docs/latest/streaming-programming-guide.html
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
http://www.rittmanmead.com/2015/08/combining-spark-streaming-and-data-frames-for-near-real-time-log-analysis/
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html
http://spark.apache.org/docs/latest/cluster-overview.html
http://milinda.pathirage.org/kappa-architecture.com/
http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark
http://thenewstack.io/spark-2-0-will-offer-interactive-querying-live-data/
http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617
https://databricks.com/blog/2015/10/13/interactive-audience-analytics-with-spark-and-hyperloglog.html
https://www.youtube.com/watch?v=ZFBgY0PwUeY
https://www.youtube.com/watch?v=oXkxXDG0gN
http://milinda.pathirage.org/kappa-architecture.com/

Mais conteúdo relacionado

Mais de Taras Matyashovsky

Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Taras Matyashovsky
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkTaras Matyashovsky
 
Morning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryMorning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryTaras Matyashovsky
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkTaras Matyashovsky
 
New life inside monolithic application
New life inside monolithic applicationNew life inside monolithic application
New life inside monolithic applicationTaras Matyashovsky
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using HazelcastTaras Matyashovsky
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 

Mais de Taras Matyashovsky (9)

Confession of an Engineer
Confession of an EngineerConfession of an Engineer
Confession of an Engineer
 
Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)Influence. The Psychology of Persuasion (in IT)
Influence. The Psychology of Persuasion (in IT)
 
JEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache SparkJEEConf 2015 - Introduction to real-time big data with Apache Spark
JEEConf 2015 - Introduction to real-time big data with Apache Spark
 
Morning at Lohika 1st anniversary
Morning at Lohika 1st anniversaryMorning at Lohika 1st anniversary
Morning at Lohika 1st anniversary
 
Introduction to real time big data with Apache Spark
Introduction to real time big data with Apache SparkIntroduction to real time big data with Apache Spark
Introduction to real time big data with Apache Spark
 
New life inside monolithic application
New life inside monolithic applicationNew life inside monolithic application
New life inside monolithic application
 
Distributed applications using Hazelcast
Distributed applications using HazelcastDistributed applications using Hazelcast
Distributed applications using Hazelcast
 
Morning at Lohika
Morning at LohikaMorning at Lohika
Morning at Lohika
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 

Último

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Último (20)

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Lambda Architecture with Apache Spark

Notas do Editor

  1. Receiver: Task that collects data from the input source and represents it as RDDs Is launched automatically for each input source Replicates data to another executor for fault tolerance
  2. Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn Cluster Manager should be chosen and configured properly Monitoring via web UI(s) and metrics Web UI: master web UI worker web UI driver web UI - available only during execution history server - spark.eventLog.enabled = true Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
  3. Spark 2.0: Project Tungsten 2.0 Whole stage code generation Optimized input / output -> Parquet + built-in cache Spark Streaming DataFrame API unified with Dataset API