SlideShare uma empresa Scribd logo
1 de 80
Apache Spark Workshop
09/2017, Stockholm
About Me
● Software Engineer from 2002
● Devops -> Dev Tooling -> Backend -> Big Data
● Experience with Spark
○ Tens of Spark daily jobs
○ Processing tens of terabytes daily
○ 150K events/sec using Spark streaming with Kafka
● Open Source
○ https://github.com/viyadb
Introduction to Spark
Spark Ecosystem
Extensions
SQL Streaming MLlib GraphX R
Core Components
Resource Managers
YARN Standalone
Distributed Storage
Spark Application
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line =>
line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
● Supported languages (by popularity)
○ Scala
○ Python
○ Java
○ R
Running Applications
● Distribution
○ Uberjar that includes all the dependencies not available in Spark
● Submitting
○ spark-submit --conf spark.key=value … --class … application.jar
RDD
● Abstraction of distributed data-set
● Scala array-like data structure
○ RDD[String]
○ RDD[(String, Int)]
○ RDD[User]
○ …
RDD - Operations
External
World
Partition 1
Partition 2
Partition K
...
Partition 1
Partition 2
Partition K
...
Partition 1
Partition 2
Partition N
...
External
World
read transform shuffle action
RDD - Operations
● Read
○ Database, Message Queue, File Storage, Socket, etc.
● Transform
○ map, flatMap, mapPartitions, filter, etc.
● Shuffle
○ reduceByKey, groupByKey, etc.
● Action
○ show, collect, count, etc.
RDD - Operations
● Are lazy: executed only when action is called
● Re-executed whenever an action is called
RDD - Creating
// From Scala objects
val data = Array(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data)
// From files
val rdd = sc.textFile("data.txt")
RDD - Custom
class MyRDD(...) extends RDD[T](...) {
// method for calculating the data for given partition
override def compute(s: Partition, c: TaskContext): Iterator[T] = {
}
// method for calculating partitions
override protected def getPartitions: Array[Partition] = {
}
}
Dataset
● Structured hierarchical data with schema
● Supports RDD and SQL operations
● Dataframe is an alias of Dataset[Row]
Creating Dataset - Using RDD
// from RDD of objects using reflection
val rdd: RDD[Person] = …
val df = rdd.toDF()
// from RDD with providing schema
val rdd = sc.textFile("people.txt")
val schema = StructType(Seq(
StructField(“name”, StringType, nullable = true), ...))
Val df = spark.createDataFrame(rdd, schema)
Creating Dataset - Using Source
// from JSON file
val df = spark.read.json("people.json")
// from Parquet files
val df = spark.read.parquet("people/*.parquet")
Querying Dataframe - DSL
val df = spark.read.json(“people.json”)
val otherDf = df.select($”name”, $”age”)
.filter($”name” === “John” && $”age” > 21)
.groupBy(“age”)
.count()
Querying Dataframe - SQL
val df = spark.read.json(“people.json”)
df.createOrReplaceTempView("people")
val sqlDF = spark.sql(“””
SELECT age, count(*) FROM people
WHERE
name = ‘John’ AND age > 21
GROUP BY
age
“””)
Some Concepts
Caching
val df = transformData(loadData())
val report1 = calculateReport1(df)
report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”)
val report2 = calculateReport2(df)
report2.write.parquet(s“s3://mycomp-reports/report2/dt=${dt}”)
// Problem: data will be loaded and transformed twice!
Caching
val df = transformData(loadData())
// cache calculated data using default storage level
df.cache()
val report1 = calculateReport1(df)
report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”)
val report2 = calculateReport2(df)
report2.write.parquet(s“s3://mycomp-reports/report2/dt=${dt}”)
Caching
val df = transformData(loadData())
// cache calculated data using specific storage level
df.persist(StorageLevel.MEMORY_AND_DISK)
val report1 = calculateReport1(df)
report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”)
val report2 = calculateReport2(df)
report2.write.parquet(s“s3://mycomp-reports/report2/dt=${dt}”)
Checkpointing
RDD
RDDRDD
RDD
RDD
Lineage
rdd.checkpoint()
Storage
● Stores RDD contents to disk
● The way to break the lineage
● There’s Dataset.checkpoint() starting from
Spark 2.1
● In memory: RDD.localCheckpoint()
Broadcasting
Worker
Executor
Task Task
Worker
Executor
Task Task
sc.broadcast(val)
val
val
make value available on all executors
Internals
Spark Components
Spark Driver
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Cluster Manager
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task
Job Scheduling
RDD Objects DAG Scheduler Task Scheduler Executor
RDD
RDDRDD
RDD
RDD
Build DAG of operators
S T T
S T T
S T T
- Split DAG into stages of tasks
- Submit each stage as ready
- Agnostic to operators
Cluster
Scheduler
- Launches tasks
- Retries failed tasks
- Agnostic to stages
Executor
Task
Task
Block manager
- Executes tasks
- Stores and serves blocks
DAG TaskSet Task
SQL Query Optimizer
Memory Management
Memory Management in YARN (EMR)
Node Memory
OS Overhead
Hadoop
Components
(HDFS Data
Node, etc.)
Yarn Container #1Spark Off-
Heap
Memory
…
Yarn Container Memory Overhead
Spark Executor Memory
Shuffle Memory
Fraction
Storage
Memory
Fraction
YARN Configuration
● Rule of thumb
○ 5 cores per executor
● Automatic configuration in EMR
○ maximizeResourceAllocation = true
Resource Allocation - Static
--executor-memory 1G --num-executors 10
Resource (Memory/CPU)
Time
Allocated
Used
Resource Allocation - Dynamic
spark.dynamicAllocation.enabled=true
Resource (Memory/CPU)
Time
Allocated
Used
Spark Streaming
Method of Operation
Input Stream Micro batches
Spark Engine
Processed data
Receiver Based Approach
Executor
Data Blocks
Receiver
Executor
Data Blocks
Spark Driver
notification on received blocks
receive data
replicate blocks
every batch interval launch task
to process blocks
Receiver Based Approach Sample
● https://github.com/spektom/spark-workshop/blob/master/spark-
streams/src/main/scala/com/github/spektom/spark/streams/ReceiverStreamJob.sca
la
Direct Approach
ExecutorSpark Driver
schedule next micro-batch job compute RDD
Direct Approach Sample
● https://github.com/spektom/spark-workshop/blob/master/spark-
streams/src/main/scala/com/github/spektom/spark/streams/DirectStreamJob.scala
Aggregation in Streams
Aggregation - Windowing
window length
sliding interval
batch interval
DStream
Aggregation - Structured Streaming
● Window operations based on event time
● Handling late data properly
● Different output sink modes
○ Complete
○ Append
○ Update
Aggregation - Watermarking
● Allow for late events counted under a correct bucket
● Too late events are dropped
● Clean in-memory state from irrelevant (old) events
Aggregation - Watermarking
Aggregation - Watermarking
val words = ... // { timestamp: Timestamp, word: String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words
.withWatermark("timestamp", "10 minutes") // late events threshold
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word")
.count()
Stateful Streaming - mapWithState
● Since Spark 1.6
● State is saved in memory
● Checkpoints to disk after (batchInterval * constant)
val updateState = (batchTime: Time, key: String, value: Option[Int], state:
State[Long]): Option[(String, Long)] => {
val sum = ...
Some((key, sum))
}
val spec = StateSpec.function(updateState)
.intialState(rdd).numPartitions(10).timeout(Seconds(60))
val statefulStream = stream.mapWithState(spec)
Stateful Streaming - mapGroupsWithState
● Since Spark 2.2
● Able to keep the state between upgrades
● State versioning allows to only keep latest version in memory
● Incremental checkpoints
Output Operations
Output Operations - Wrong
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
Output Operations - Nonoptimal
dstream.foreachRDD { rdd =>
rdd.foreach { record =>
val connection = createNewConnection() // executed at the worker
connection.send(record)
connection.close()
}
}
Output Operations - Better
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
}
Output Operations - Best
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
// return to the pool for future reuse
ConnectionPool.returnConnection(connection)
}
}
Streaming Application Goals
● Have processing time less than batch interval
● Handle failures properly
Resilience
Resilience - All Components
All Spark components must be resilient!
● Driver application process
● Master process
● Worker process
● Executor process
● Receiver thread
● Worker node
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
Driver
Resilience - Driver
● Client mode
○ Driver application is running inside the “spark-submit” process
○ If this process dies the entire application is killed
● Cluster mode
○ Driver application runs on one of worker nodes
○ “--supervise” option makes driver restart on a different worker node.
● Automatic restart through app launcher
○ Marathon
○ Consul
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
Resilience - Master
● Single master
○ The entire application is killed
● Multi-master mode
○ A standby master is elected active
○ Worker nodes automatically register with new master
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
Resilience - Worker
● Worker process
○ When failed, all child processes (driver or executor) are killed
○ New worker process is launched automatically
● Executor process
○ Restarted on failure by the parent worker process
● Receiver thread
○ Running inside the Executor process - same as Executor
● Worker node
○ Failure of worker node behaves the
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
Profiling
Profiling - Spark UI
● Data skews
● Misconfiguration
● Memory issues
● Nonoptimal computations
● Streaming health
Profiling - Flame Graph
Easy way: https://github.com/spektom/spark-flamegraph
Profiling - GC Log
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Profiling - “Explain” Queries
dataframe.explain (rdd.toDebugString)
Monitoring
Monitoring
Cluster Manager
Workflow Manager
Spark Application
Sinks
OS
StatsD Graphite
Automation
Monitoring - Cluster Manager
● Report on underutilized resources (cost)
● Implement dynamic scaling (failover, cost)
Monitoring - Workflow Manager
● Jobs SLA
● Execution times
● Failures
Monitoring - Spark Application
● Spark metrics
○ Via statsd (via extension or natively in 2.3.0)
● Application health
○ Spark listeners are helpful
● Application business metrics
Optimizing Spark
Optimizing
● Start from here: http://spark.apache.org/docs/latest/tuning.html
● YARN-based clusters
○ Cloudera blog / documentation
Optimizing - Partitions Number
● Number of files/blocks is important
● Size of file/block is important
● Control shuffle partitions number
Optimizing - GC
● Try different GC kind
● Set Young Generation size
● Where most of your objects live?
Optimizing - Amazon AWS
● Choose appropriate EC2 Instance type
● Prefer attached SSD
● Verify enhanced networking is enabled
Optimizing - OS
● Disable transparent hugepages
● Disable host swappiness
● Increase max number of open files
● Tune SSD mount configuration
Optimizing - Storage
● Reduce data structures size
○ Enforce Kryo serialization
● Choose appropriate data storage format
○ Parquet
○ Avro
● Use compression
Optimizing - Data
● Avoid data skews
● If you can’t avoid them:
○ Use salting
○ Progressive sharding
○ Bin-packing with custom partitioner
Optimizing - Code
● Avoid JOINs
● Minimize number of aggregations
● In general, avoid shuffles
● map vs mapPartitions
● groupByKey vs reduceByKey
● etc.
Optimizing - Broadcast JOIN
● Prerequisite: one of joined dataset is small
● spark.sql.autoBroadcastJoinThreshold (using Hive metadata)
● broadcast(dataset) as a hint
Optimizing - Tips
● Learn to use Spark UI
● Look for optimizations for your specific case in Google
● Use explain() / toDebugString()
Resources
● https://spark.apache.org/documentation.html
● https://jaceklaskowski.gitbooks.io/mastering-apache-spark
Thank You!

Mais conteúdo relacionado

Mais procurados

Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Holden Karau
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Holden Karau
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureDr. Christian Betz
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraRussell Spitzer
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Holden Karau
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)wqchen
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayMatthias Niehoff
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learningSamir Bessalah
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Matthias Niehoff
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...Holden Karau
 

Mais procurados (20)

Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
 
Scala+data
Scala+dataScala+data
Scala+data
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016Beyond shuffling - Scala Days Berlin 2016
Beyond shuffling - Scala Days Berlin 2016
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLabAdvanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
Advanced Spark Programming - Part 1 | Big Data Hadoop Spark Tutorial | CloudxLab
 
Big Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and ClojureBig Data Processing using Apache Spark and Clojure
Big Data Processing using Apache Spark and Clojure
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Zero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and CassandraZero to Streaming: Spark and Cassandra
Zero to Streaming: Spark and Cassandra
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
 
SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)SparkR - Play Spark Using R (20160909 HadoopCon)
SparkR - Play Spark Using R (20160909 HadoopCon)
 
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials DayAnalytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
Analytics with Cassandra, Spark & MLLib - Cassandra Essentials Day
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
scalable machine learning
scalable machine learningscalable machine learning
scalable machine learning
 
Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra Big data analytics with Spark & Cassandra
Big data analytics with Spark & Cassandra
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
 

Semelhante a Apache Spark Workshop

Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark IntroductionRich Lee
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...Inhacking
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Аліна Шепшелей
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and SharkYahooTechConference
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax EnablementVincent Poncet
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in SparkShiao-An Yuan
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopDataWorks Summit
 
Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22Dustin Vannoy
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to SparkKyle Burke
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xincaidezhi655
 
Spark Programming
Spark ProgrammingSpark Programming
Spark ProgrammingTaewook Eom
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with ScalaHimanshu Gupta
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Brandon O'Brien
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2Fabio Fumarola
 

Semelhante a Apache Spark Workshop (20)

Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark and DataStax Enablement
Apache Spark and DataStax EnablementApache Spark and DataStax Enablement
Apache Spark and DataStax Enablement
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
SparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on HadoopSparkR: Enabling Interactive Data Science at Scale on Hadoop
SparkR: Enabling Interactive Data Science at Scale on Hadoop
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22Getting Started with Spark Structured Streaming - Current 22
Getting Started with Spark Structured Streaming - Current 22
 
Intro to Spark
Intro to SparkIntro to Spark
Intro to Spark
 
Advanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xinAdvanced spark training advanced spark internals and tuning reynold xin
Advanced spark training advanced spark internals and tuning reynold xin
 
Spark Programming
Spark ProgrammingSpark Programming
Spark Programming
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
Real Time Data Processing With Spark Streaming, Node.js and Redis with Visual...
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 

Último

Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Intelisync
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 

Último (20)

Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)Introduction to Decentralized Applications (dApps)
Introduction to Decentralized Applications (dApps)
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 

Apache Spark Workshop

  • 2. About Me ● Software Engineer from 2002 ● Devops -> Dev Tooling -> Backend -> Big Data ● Experience with Spark ○ Tens of Spark daily jobs ○ Processing tens of terabytes daily ○ 150K events/sec using Spark streaming with Kafka ● Open Source ○ https://github.com/viyadb
  • 4. Spark Ecosystem Extensions SQL Streaming MLlib GraphX R Core Components Resource Managers YARN Standalone Distributed Storage
  • 5. Spark Application val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") ● Supported languages (by popularity) ○ Scala ○ Python ○ Java ○ R
  • 6. Running Applications ● Distribution ○ Uberjar that includes all the dependencies not available in Spark ● Submitting ○ spark-submit --conf spark.key=value … --class … application.jar
  • 7. RDD ● Abstraction of distributed data-set ● Scala array-like data structure ○ RDD[String] ○ RDD[(String, Int)] ○ RDD[User] ○ …
  • 8. RDD - Operations External World Partition 1 Partition 2 Partition K ... Partition 1 Partition 2 Partition K ... Partition 1 Partition 2 Partition N ... External World read transform shuffle action
  • 9. RDD - Operations ● Read ○ Database, Message Queue, File Storage, Socket, etc. ● Transform ○ map, flatMap, mapPartitions, filter, etc. ● Shuffle ○ reduceByKey, groupByKey, etc. ● Action ○ show, collect, count, etc.
  • 10. RDD - Operations ● Are lazy: executed only when action is called ● Re-executed whenever an action is called
  • 11. RDD - Creating // From Scala objects val data = Array(1, 2, 3, 4, 5) val rdd = sc.parallelize(data) // From files val rdd = sc.textFile("data.txt")
  • 12. RDD - Custom class MyRDD(...) extends RDD[T](...) { // method for calculating the data for given partition override def compute(s: Partition, c: TaskContext): Iterator[T] = { } // method for calculating partitions override protected def getPartitions: Array[Partition] = { } }
  • 13. Dataset ● Structured hierarchical data with schema ● Supports RDD and SQL operations ● Dataframe is an alias of Dataset[Row]
  • 14. Creating Dataset - Using RDD // from RDD of objects using reflection val rdd: RDD[Person] = … val df = rdd.toDF() // from RDD with providing schema val rdd = sc.textFile("people.txt") val schema = StructType(Seq( StructField(“name”, StringType, nullable = true), ...)) Val df = spark.createDataFrame(rdd, schema)
  • 15. Creating Dataset - Using Source // from JSON file val df = spark.read.json("people.json") // from Parquet files val df = spark.read.parquet("people/*.parquet")
  • 16. Querying Dataframe - DSL val df = spark.read.json(“people.json”) val otherDf = df.select($”name”, $”age”) .filter($”name” === “John” && $”age” > 21) .groupBy(“age”) .count()
  • 17. Querying Dataframe - SQL val df = spark.read.json(“people.json”) df.createOrReplaceTempView("people") val sqlDF = spark.sql(“”” SELECT age, count(*) FROM people WHERE name = ‘John’ AND age > 21 GROUP BY age “””)
  • 19. Caching val df = transformData(loadData()) val report1 = calculateReport1(df) report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”) val report2 = calculateReport2(df) report2.write.parquet(s“s3://mycomp-reports/report2/dt=${dt}”) // Problem: data will be loaded and transformed twice!
  • 20. Caching val df = transformData(loadData()) // cache calculated data using default storage level df.cache() val report1 = calculateReport1(df) report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”) val report2 = calculateReport2(df) report2.write.parquet(s“s3://mycomp-reports/report2/dt=${dt}”)
  • 21. Caching val df = transformData(loadData()) // cache calculated data using specific storage level df.persist(StorageLevel.MEMORY_AND_DISK) val report1 = calculateReport1(df) report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”) val report2 = calculateReport2(df) report2.write.parquet(s“s3://mycomp-reports/report2/dt=${dt}”)
  • 22. Checkpointing RDD RDDRDD RDD RDD Lineage rdd.checkpoint() Storage ● Stores RDD contents to disk ● The way to break the lineage ● There’s Dataset.checkpoint() starting from Spark 2.1 ● In memory: RDD.localCheckpoint()
  • 25. Spark Components Spark Driver val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") Cluster Manager Worker Node Executor Task Task Worker Node Executor Task Task
  • 26. Job Scheduling RDD Objects DAG Scheduler Task Scheduler Executor RDD RDDRDD RDD RDD Build DAG of operators S T T S T T S T T - Split DAG into stages of tasks - Submit each stage as ready - Agnostic to operators Cluster Scheduler - Launches tasks - Retries failed tasks - Agnostic to stages Executor Task Task Block manager - Executes tasks - Stores and serves blocks DAG TaskSet Task
  • 29. Memory Management in YARN (EMR) Node Memory OS Overhead Hadoop Components (HDFS Data Node, etc.) Yarn Container #1Spark Off- Heap Memory … Yarn Container Memory Overhead Spark Executor Memory Shuffle Memory Fraction Storage Memory Fraction
  • 30. YARN Configuration ● Rule of thumb ○ 5 cores per executor ● Automatic configuration in EMR ○ maximizeResourceAllocation = true
  • 31. Resource Allocation - Static --executor-memory 1G --num-executors 10 Resource (Memory/CPU) Time Allocated Used
  • 32. Resource Allocation - Dynamic spark.dynamicAllocation.enabled=true Resource (Memory/CPU) Time Allocated Used
  • 34. Method of Operation Input Stream Micro batches Spark Engine Processed data
  • 35. Receiver Based Approach Executor Data Blocks Receiver Executor Data Blocks Spark Driver notification on received blocks receive data replicate blocks every batch interval launch task to process blocks
  • 36. Receiver Based Approach Sample ● https://github.com/spektom/spark-workshop/blob/master/spark- streams/src/main/scala/com/github/spektom/spark/streams/ReceiverStreamJob.sca la
  • 37. Direct Approach ExecutorSpark Driver schedule next micro-batch job compute RDD
  • 38. Direct Approach Sample ● https://github.com/spektom/spark-workshop/blob/master/spark- streams/src/main/scala/com/github/spektom/spark/streams/DirectStreamJob.scala
  • 40. Aggregation - Windowing window length sliding interval batch interval DStream
  • 41. Aggregation - Structured Streaming ● Window operations based on event time ● Handling late data properly ● Different output sink modes ○ Complete ○ Append ○ Update
  • 42. Aggregation - Watermarking ● Allow for late events counted under a correct bucket ● Too late events are dropped ● Clean in-memory state from irrelevant (old) events
  • 44. Aggregation - Watermarking val words = ... // { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words .withWatermark("timestamp", "10 minutes") // late events threshold .groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word") .count()
  • 45. Stateful Streaming - mapWithState ● Since Spark 1.6 ● State is saved in memory ● Checkpoints to disk after (batchInterval * constant) val updateState = (batchTime: Time, key: String, value: Option[Int], state: State[Long]): Option[(String, Long)] => { val sum = ... Some((key, sum)) } val spec = StateSpec.function(updateState) .intialState(rdd).numPartitions(10).timeout(Seconds(60)) val statefulStream = stream.mapWithState(spec)
  • 46. Stateful Streaming - mapGroupsWithState ● Since Spark 2.2 ● Able to keep the state between upgrades ● State versioning allows to only keep latest version in memory ● Incremental checkpoints
  • 48. Output Operations - Wrong dstream.foreachRDD { rdd => val connection = createNewConnection() // executed at the driver rdd.foreach { record => connection.send(record) // executed at the worker } }
  • 49. Output Operations - Nonoptimal dstream.foreachRDD { rdd => rdd.foreach { record => val connection = createNewConnection() // executed at the worker connection.send(record) connection.close() } }
  • 50. Output Operations - Better dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => val connection = createNewConnection() partitionOfRecords.foreach(record => connection.send(record)) connection.close() } }
  • 51. Output Operations - Best dstream.foreachRDD { rdd => rdd.foreachPartition { partitionOfRecords => // ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record => connection.send(record)) // return to the pool for future reuse ConnectionPool.returnConnection(connection) } }
  • 52. Streaming Application Goals ● Have processing time less than batch interval ● Handle failures properly
  • 54. Resilience - All Components All Spark components must be resilient! ● Driver application process ● Master process ● Worker process ● Executor process ● Receiver thread ● Worker node Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  • 55. Driver Resilience - Driver ● Client mode ○ Driver application is running inside the “spark-submit” process ○ If this process dies the entire application is killed ● Cluster mode ○ Driver application runs on one of worker nodes ○ “--supervise” option makes driver restart on a different worker node. ● Automatic restart through app launcher ○ Marathon ○ Consul Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  • 56. Resilience - Master ● Single master ○ The entire application is killed ● Multi-master mode ○ A standby master is elected active ○ Worker nodes automatically register with new master Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  • 57. Resilience - Worker ● Worker process ○ When failed, all child processes (driver or executor) are killed ○ New worker process is launched automatically ● Executor process ○ Restarted on failure by the parent worker process ● Receiver thread ○ Running inside the Executor process - same as Executor ● Worker node ○ Failure of worker node behaves the Driver Master Worker Node Executor Tas k Tas k Worker Node Executor Tas k Tas k
  • 59. Profiling - Spark UI ● Data skews ● Misconfiguration ● Memory issues ● Nonoptimal computations ● Streaming health
  • 60. Profiling - Flame Graph Easy way: https://github.com/spektom/spark-flamegraph
  • 61. Profiling - GC Log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
  • 62. Profiling - “Explain” Queries dataframe.explain (rdd.toDebugString)
  • 64. Monitoring Cluster Manager Workflow Manager Spark Application Sinks OS StatsD Graphite Automation
  • 65. Monitoring - Cluster Manager ● Report on underutilized resources (cost) ● Implement dynamic scaling (failover, cost)
  • 66. Monitoring - Workflow Manager ● Jobs SLA ● Execution times ● Failures
  • 67. Monitoring - Spark Application ● Spark metrics ○ Via statsd (via extension or natively in 2.3.0) ● Application health ○ Spark listeners are helpful ● Application business metrics
  • 69. Optimizing ● Start from here: http://spark.apache.org/docs/latest/tuning.html ● YARN-based clusters ○ Cloudera blog / documentation
  • 70. Optimizing - Partitions Number ● Number of files/blocks is important ● Size of file/block is important ● Control shuffle partitions number
  • 71. Optimizing - GC ● Try different GC kind ● Set Young Generation size ● Where most of your objects live?
  • 72. Optimizing - Amazon AWS ● Choose appropriate EC2 Instance type ● Prefer attached SSD ● Verify enhanced networking is enabled
  • 73. Optimizing - OS ● Disable transparent hugepages ● Disable host swappiness ● Increase max number of open files ● Tune SSD mount configuration
  • 74. Optimizing - Storage ● Reduce data structures size ○ Enforce Kryo serialization ● Choose appropriate data storage format ○ Parquet ○ Avro ● Use compression
  • 75. Optimizing - Data ● Avoid data skews ● If you can’t avoid them: ○ Use salting ○ Progressive sharding ○ Bin-packing with custom partitioner
  • 76. Optimizing - Code ● Avoid JOINs ● Minimize number of aggregations ● In general, avoid shuffles ● map vs mapPartitions ● groupByKey vs reduceByKey ● etc.
  • 77. Optimizing - Broadcast JOIN ● Prerequisite: one of joined dataset is small ● spark.sql.autoBroadcastJoinThreshold (using Hive metadata) ● broadcast(dataset) as a hint
  • 78. Optimizing - Tips ● Learn to use Spark UI ● Look for optimizations for your specific case in Google ● Use explain() / toDebugString()

Notas do Editor

  1. The purpose of the talk is to teach theoretical material, but rather have an interactive conversation for understanding of how to improve existing pipelines. There might be some topics missing, I tried to collect anything important in my opinion. If there’s something that interests you, which is missing please tell. Maybe the lecture is not organized well, this is the first time I’m presenting it (will love to get your feedback).
  2. Spark is distributed computing and data processing engine R is a programming language for statistical computing and graphics
  3. We’ll concentrate on Scala from now on
  4. Use class relocation for resolving conflicts
  5. RDD abstraction enables Spark programming model
  6. Number of partitions is determined by: The source (in HDFS: number of partitions = number of blocks) Default parallelism setting Stages are separated by shuffles If something can start without waiting for previous results - it can run in the same stage
  7. Transform operations run on the same Task
  8. It’s a higher abstraction Examples of hierarchical data: JSON, class objects, Protobuf messages, etc. Constructing from RDD: Inferring schema using reflection Providing schema
  9. A couple of useful concepts that are crucial for optimizing Spark workloads
  10. Sometimes, saving intermediate results to S3 and loading them later is faster than using Spark caching mechanism.
  11. Checkpointing is extremely helpful in streaming applications, when previous dataset is unioned with the current one. Without it, RDD graph will grow endlessly. localCheckpoint trades fault tolerance for performance
  12. This is a regular value, not rdd or dataframe
  13. Spark driver is a coordinator of application Driver and executors are separate JVM processes Task handles one partition There’s option to tell how many CPU cores to use per task Having multiple tasks running on the same executor can help share memory between them Operation A standalone application starts and instantiates a SparkContext instance (and it is only then when you can call the application a driver). The driver program ask for resources to the cluster manager to launch executors. The cluster manager launches executors. The driver process runs through the user application. Depending on the actions and transformations over RDDs task are sent to executors. Executors run the tasks and save the results. If any worker crashes, its tasks will be sent to different executors to be processed again. With SparkContext.stop() from the driver or if the main method exits/crashes all the executors will be terminated and the cluster resources will be released by the cluster manager.
  14. Resolving logical plan - resolving references like: views, etc. Logical optimization: joining two datasets, one has date range, the other hasn’t, and we’re joining on the date. Or, removing unused columns. Logical plan describes dataset computation without defining how to conduct it
  15. There are two levels of configuration: YARN and Spark The configuration is complex. Cloudera packages make it simpler.
  16. Automatic configuration is applied based on selected instance types Dynamic configuration is enabled automatically
  17. Good for single applications running on their own cluster
  18. Good for multitenant clusters
  19. Not really a streaming engine Advantages Reuse existing infra Rich API Straightforward windowing It’s easier to implement “exactly once” Disadvantages Latency
  20. Receiver is run as Task Block is defined by block interval Block manager responsible for storing and replicating blocks Blocks can be written to WAL to provide fault tolerance
  21. RDD is calculated as part of every micro-batch job In case of Kafka Driver computes offsets to fetch from Every RDD is calculated based on start/end offsets RDD is KafkaRDD which is responsible for retrieving data per partition
  22. First appeared in Spark 2.0 Complete - dumps everything to the output sink Append - only writes new rows to the output sink Update - writes all new and updated rows to the output sink
  23. First appeared in Spark 2.1 This model was first seen in Google Datalow, which later was contributed to Apache Beam
  24. Late events threshold - 10 minutes Window size - 10 minutes Slide interval - 5 minutes
  25. Instead of using Marathon multiple instances of scripts that acquire a lock in Consul can be used
  26. Flame graph is a modern way of call time profiling
  27. Spark UI can help pinpoint things like: data skews, misconfiguration, wrongly written code, etc. Flame graph is a modern way of call time profiling GC log must be enabled if you plan to tune GC (and you should) Explain query shows physical plan
  28. Spark UI can help pinpoint things like: data skews, misconfiguration, wrongly written code, etc. Flame graph is a modern way of call time profiling GC log must be enabled if you plan to tune GC (and you should) Explain query shows physical plan
  29. Spark metrics contain quite everything you’ll find on Spark UI
  30. There’s no rule of thumb, tuning is always according to the workload
  31. Spark has great documentation Website
  32. Partitions number affect how Spark cluster is utilized. There should be 2-3 tasks running per core. Control number of output files (there’s new coalesce that takes desired files size as input) spark.sql.shuffle.partitions helps control shuffle parallelism (default: 200). This also controls shuffle block size (whose limit is 2G). Rule of thumb - 128Mb per partition If number of partitions is close to 2000 - bump it to 2001 (Spark uses different data structure for bookkeeping during shuffling)
  33. GC must be tuned according to a workload We used CMS, G1 Some workloads require bigger young generation
  34. Stronger EC2 instances can solve disk/network issues. Verify that enhanced networking is enabled in kernel (if it’s supported by instance type). The slide is taken from Amazon AWS slides
  35. Hugepages (HP) created for solving small memory block size issue. THP was aimed to ease the process of configuring HP, but the feature doesn’t work properly - memory becomes fragmented. Swappiness define swapping frequency. 0 - disable completely, 1 - emergency swappiness (prevents from killing processes)
  36. Kryo serialization is faster and more compact, can speed up by 10%-20% depending on a workload Columnar formats like Parquet allow reading only required columns. Avro might be better if you always read whole columns. Different compression algorithms give different compression rates at price of performance. Bottom line: gzip compression is quite good.
  37. Salting: adding random element to partition keys. Howto: Salt keys -> Do the operation on salted keys -> Unsalt keys. Progressive sharding: iterative process of detecting shard size, then avoiding skewed shards in each iteration
  38. Use windowing functions to minimize number of aggregations (and to avoid joins) groupByKey first groups then does the reduce, reduceByKey works otherwise thus minimising target data set
  39. There’s plenty of information on how to optimize Spark for specific use case For example, speeding up data access on S3, etc.