SlideShare uma empresa Scribd logo
1 de 37
Baixar para ler offline
Spark Streaming Tips
for
Devs & Ops
WHO ARE WE?
Fede Fernández
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@fede_fdz
Fran Pérez
Scala Software Engineer at 47 Degrees
Spark Certified Developer
@FPerezP
Overview
Spark Streaming
Spark + Kafka
groupByKey vs reduceByKey
Table Joins
Serializer
Tunning
Spark Streaming
Real-time data processing
Continuous Data Flow
RDD
RDD
RDD
DStream
Output Data
Spark + Kafka
● Receiver-based Approach
○ At least once (with Write Ahead Logs)
● Direct API
○ Exactly once
Spark + Kafka
● Receiver-based Approach
Spark + Kafka
● Direct API
groupByKey VS reduceByKey
● groupByKey
○ Groups pairs of data with the same key.
● reduceByKey
○ Groups and combines pairs of data based on a reduce
operation.
groupByKey VS reduceByKey
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).groupByKey.map(t => (t._1, t._2.sum))
sc.textFile(“hdfs://….”)
.flatMap(_.split(“ “))
.map((_, 1)).reduceByKey(_ + _)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
shuffle shuffle
groupByKey
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 1)
(c, 1)
(c, 1)
(c, 4)
shuffle shuffle
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
(s, 1)
(s, 1)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 1)
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
reduceByKey
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(c, 1)
(c, 2)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
reduceByKey
(j, 2)
(j, 1)
(j, 1)
(j, 1)
(j, 5)
(s, 2)
(s, 2)
(s, 1)
(s, 1)
(s, 6)
(c, 1)
(c, 2)
(c, 1)
(c, 4)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
(j, 2)
(c, 1)
(s, 2)
(j, 1)
(s, 1)
(s, 1)
(j, 1)
(s, 2)
(s, 1)
(c, 1)
(c, 1)
(j, 1)
(j, 1)
(c, 2)
(s, 1)
(c, 1)
(s, 1)
(j, 1)
(j, 1)
(c, 1)
(s, 1)
shuffle shuffle
reduce VS group
● Improve performance
● Can’t always be used
● Out of Memory Exceptions
● aggregateByKey, foldByKey, combineByKey
Table Joins
● Typical operations that can be improved
● Need a previous analysis
● There are no silver bullets
Table Joins: Medium - Large
Table Joins: Medium - Large
FILTER
No Shuffle
Table Joins: Small - Large
...
Shuffled Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- SortMergeJoin]
[ :- Sort]
[ : +- TungstenExchange hashpartitioning]
[ : +- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Sort]
[ +- TungstenExchange hashpartitioning]
[ +- ConvertToUnsafe]
[ +- Scan ExistingRDD]
Table Joins: Small - Large
Broadcast Hash Join
sqlContext.sql("explain <select>").collect.mkString(“n”)
[== Physical Plan ==]
[Project]
[+- BroadcastHashJoin]
[ :- TungstenExchange RoundRobinPartitioning]
[ : +- ConvertToUnsafe]
[ : +- Scan ExistingRDD]
[ +- Scan ParquetRelation]
No shuffle!
By default from Spark 1.4 when using DataFrame API
Prior Spark 1.4
ANALYZE TABLE small_table COMPUTE STATISTICS noscan
Broadcast
Table Joins: Small - Large
Serializers
● Java’s ObjectOutputStream framework. (Default)
● Custom serializers: extends Serializable & Externalizable.
● KryoSerializer: register your custom classes.
● Where is our code being run?
● Special care to JodaTime.
Tuning
Garbage Collector
blockInterval
Partitioning
Storage
Tuning: Garbage Collector
• Applications which rely heavily on memory consumption.
• GC Strategies
• Concurrent Mark Sweep (CMS) GC
• ParallelOld GC
• Garbage-First GC
• Tuning steps:
• Review your logic and object management
• Try Garbage-First
• Activate and inspect the logs
Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
Tuning: blockInterval
blockInterval = (bi * consumers) / (pf * sc)
● CAT: Total cores per partition.
● bi: Batch Interval time in milliseconds.
● consumers: number of streaming consumers.
● pf (partitionFactor): number of partitions per core.
● sc (sparkCores): CAT - consumers.
blockInterval: example
● batchIntervalMillis = 600,000
● consumers = 20
● CAT = 120
● sparkCores = 120 - 20 = 100
● partitionFactor = 3
blockInterval = (bi * consumers) / (pf * sc) =
(600,000 * 20) / (3 * 100) =
40,000
Tuning: Partitioning
partitions = consumers * bi / blockInterval
● consumers: number of streaming consumers.
● bi: Batch Interval time in milliseconds.
● blockInterval: time size to split data before storing into
Spark.
Partitioning: example
● batchIntervalMillis = 600,000
● consumers = 20
● blockInterval = 40,000
partitions = consumers * bi / blockInterval =
20 * 600,000/ 40,000=
30
Tuning: Storage
• Default (MEMORY_ONLY)
• MEMORY_ONLY_SER with Serialization Library
• MEMORY_AND_DISK & DISK_ONLY
• Replicated _2
• OFF_HEAP (Tachyon/Alluxio)
Where to find more information?
Spark Official Documentation
Databricks Blog
Databricks Spark Knowledge Base
Spark Notebook - By Andy Petrella
Databricks YouTube Channel
QUESTIONS
Fede Fernández
@fede_fdz
fede.f@47deg.com
Fran Pérez
@FPerezP
fran.p@47deg.com
Thanks!

Mais conteúdo relacionado

Mais procurados

C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
DataStax
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
EDB
 

Mais procurados (18)

R and cpp
R and cppR and cpp
R and cpp
 
Scala+data
Scala+dataScala+data
Scala+data
 
Storing metrics at scale with Gnocchi
Storing metrics at scale with GnocchiStoring metrics at scale with Gnocchi
Storing metrics at scale with Gnocchi
 
Scaling the #2ndhalf
Scaling the #2ndhalfScaling the #2ndhalf
Scaling the #2ndhalf
 
Gnocchi v4 (preview)
Gnocchi v4 (preview)Gnocchi v4 (preview)
Gnocchi v4 (preview)
 
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake SolutionCeph Object Storage Performance Secrets and Ceph Data Lake Solution
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
 
JEE on DC/OS
JEE on DC/OSJEE on DC/OS
JEE on DC/OS
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Scaling Flink in Cloud
Scaling Flink in CloudScaling Flink in Cloud
Scaling Flink in Cloud
 
The State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVMThe State of Lightweight Threads for the JVM
The State of Lightweight Threads for the JVM
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
C*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with CassandraC*ollege Credit: Creating Your First App in Java with Cassandra
C*ollege Credit: Creating Your First App in Java with Cassandra
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Garbage collection in .net (basic level)
Garbage collection in .net (basic level)Garbage collection in .net (basic level)
Garbage collection in .net (basic level)
 
Unified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco SystemsUnified Data Platform, by Pauline Yeung of Cisco Systems
Unified Data Platform, by Pauline Yeung of Cisco Systems
 
Time Series Processing with Solr and Spark
Time Series Processing with Solr and SparkTime Series Processing with Solr and Spark
Time Series Processing with Solr and Spark
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 

Semelhante a Spark Streaming Tips for Devs and Ops

Semelhante a Spark Streaming Tips for Devs and Ops (20)

MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012MongoDB, Hadoop and humongous data - MongoSV 2012
MongoDB, Hadoop and humongous data - MongoSV 2012
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Final_show
Final_showFinal_show
Final_show
 
Sparkcamp stratasingapore
Sparkcamp stratasingaporeSparkcamp stratasingapore
Sparkcamp stratasingapore
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache BeamScio - A Scala API for Google Cloud Dataflow & Apache Beam
Scio - A Scala API for Google Cloud Dataflow & Apache Beam
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Spark Streaming Tips for Devs and Ops

  • 2.
  • 3. WHO ARE WE? Fede Fernández Scala Software Engineer at 47 Degrees Spark Certified Developer @fede_fdz Fran Pérez Scala Software Engineer at 47 Degrees Spark Certified Developer @FPerezP
  • 4. Overview Spark Streaming Spark + Kafka groupByKey vs reduceByKey Table Joins Serializer Tunning
  • 5. Spark Streaming Real-time data processing Continuous Data Flow RDD RDD RDD DStream Output Data
  • 6. Spark + Kafka ● Receiver-based Approach ○ At least once (with Write Ahead Logs) ● Direct API ○ Exactly once
  • 7. Spark + Kafka ● Receiver-based Approach
  • 8. Spark + Kafka ● Direct API
  • 9. groupByKey VS reduceByKey ● groupByKey ○ Groups pairs of data with the same key. ● reduceByKey ○ Groups and combines pairs of data based on a reduce operation.
  • 10. groupByKey VS reduceByKey sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).groupByKey.map(t => (t._1, t._2.sum)) sc.textFile(“hdfs://….”) .flatMap(_.split(“ “)) .map((_, 1)).reduceByKey(_ + _)
  • 11. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 12. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 13. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 14. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (c, 1) (c, 1) shuffle shuffle
  • 15. groupByKey (c, 1) (s, 1) (j, 1) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (s, 1) (s, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 1) (j, 5) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 1) (s, 6) (c, 1) (c, 1) (c, 1) (c, 1) (c, 4) shuffle shuffle
  • 16. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 1) (s, 1) (s, 1) (s, 1) (c, 1) (c, 1) (j, 1) (c, 1) (s, 1) (j, 1)
  • 17. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 18. reduceByKey (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1)
  • 19. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (s, 2) (s, 2) (s, 1) (s, 1) (c, 1) (c, 2) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 20. reduceByKey (j, 2) (j, 1) (j, 1) (j, 1) (j, 5) (s, 2) (s, 2) (s, 1) (s, 1) (s, 6) (c, 1) (c, 2) (c, 1) (c, 4) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) (j, 2) (c, 1) (s, 2) (j, 1) (s, 1) (s, 1) (j, 1) (s, 2) (s, 1) (c, 1) (c, 1) (j, 1) (j, 1) (c, 2) (s, 1) (c, 1) (s, 1) (j, 1) (j, 1) (c, 1) (s, 1) shuffle shuffle
  • 21. reduce VS group ● Improve performance ● Can’t always be used ● Out of Memory Exceptions ● aggregateByKey, foldByKey, combineByKey
  • 22. Table Joins ● Typical operations that can be improved ● Need a previous analysis ● There are no silver bullets
  • 24. Table Joins: Medium - Large FILTER No Shuffle
  • 25. Table Joins: Small - Large ... Shuffled Hash Join sqlContext.sql("explain <select>").collect.mkString(“n”) [== Physical Plan ==] [Project] [+- SortMergeJoin] [ :- Sort] [ : +- TungstenExchange hashpartitioning] [ : +- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Sort] [ +- TungstenExchange hashpartitioning] [ +- ConvertToUnsafe] [ +- Scan ExistingRDD]
  • 26. Table Joins: Small - Large Broadcast Hash Join sqlContext.sql("explain <select>").collect.mkString(“n”) [== Physical Plan ==] [Project] [+- BroadcastHashJoin] [ :- TungstenExchange RoundRobinPartitioning] [ : +- ConvertToUnsafe] [ : +- Scan ExistingRDD] [ +- Scan ParquetRelation] No shuffle! By default from Spark 1.4 when using DataFrame API Prior Spark 1.4 ANALYZE TABLE small_table COMPUTE STATISTICS noscan Broadcast
  • 28. Serializers ● Java’s ObjectOutputStream framework. (Default) ● Custom serializers: extends Serializable & Externalizable. ● KryoSerializer: register your custom classes. ● Where is our code being run? ● Special care to JodaTime.
  • 30. Tuning: Garbage Collector • Applications which rely heavily on memory consumption. • GC Strategies • Concurrent Mark Sweep (CMS) GC • ParallelOld GC • Garbage-First GC • Tuning steps: • Review your logic and object management • Try Garbage-First • Activate and inspect the logs Reference: https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
  • 31. Tuning: blockInterval blockInterval = (bi * consumers) / (pf * sc) ● CAT: Total cores per partition. ● bi: Batch Interval time in milliseconds. ● consumers: number of streaming consumers. ● pf (partitionFactor): number of partitions per core. ● sc (sparkCores): CAT - consumers.
  • 32. blockInterval: example ● batchIntervalMillis = 600,000 ● consumers = 20 ● CAT = 120 ● sparkCores = 120 - 20 = 100 ● partitionFactor = 3 blockInterval = (bi * consumers) / (pf * sc) = (600,000 * 20) / (3 * 100) = 40,000
  • 33. Tuning: Partitioning partitions = consumers * bi / blockInterval ● consumers: number of streaming consumers. ● bi: Batch Interval time in milliseconds. ● blockInterval: time size to split data before storing into Spark.
  • 34. Partitioning: example ● batchIntervalMillis = 600,000 ● consumers = 20 ● blockInterval = 40,000 partitions = consumers * bi / blockInterval = 20 * 600,000/ 40,000= 30
  • 35. Tuning: Storage • Default (MEMORY_ONLY) • MEMORY_ONLY_SER with Serialization Library • MEMORY_AND_DISK & DISK_ONLY • Replicated _2 • OFF_HEAP (Tachyon/Alluxio)
  • 36. Where to find more information? Spark Official Documentation Databricks Blog Databricks Spark Knowledge Base Spark Notebook - By Andy Petrella Databricks YouTube Channel