4. Spark Architecture
Driver
Hosts SparkContext
Cockpit of Jobs and Task
Execution
Schedules Tasks to run
on executors
Contains DAGScheduler
and TaskScheduler
Executor
Static allocation vs
Dynamic allocation
Sends Heartbeat and Metrics
Provides In memory storage
for RDD
Communicates directly with
driver to execute task
8. A sample Spark program
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val file = sc.textFile("hdfs://...") // This is an RDD
val errors = file.filter(_.contains("ERROR")) // This is an RDD
val errorCount = errors.count() // This is an “action”
9. Job-Stage-Task
What is a Job?
Top level work item
Computation job ==
Computation Partition of RDD
Target RDD Lineage
10. Job-Stage-Task
Job divided into stages
Logical Plan → Physical plan (execution unit)
Set of Parallel task
Stage Boundary (Shuffle)
Computation of stage triggers parents stage
execution
11. Types of Stages
ShuffleMapStage:
Intermediate stage in execution DAG
Saves map output → fetched later
Pipelined operations before shuffle
Can be Shared across jobs
ResultStage:
Final stage executing action
Works on one or many partitions
12. Job-Stage-Task
Smallest unit of execution
Comprises of function and placement
preference
Task operate on a single partition
Launched on executor and ran there
18. DAGScheduler Responsibilities
● Computes an execution DAG
● Determines the preferred locations to run each
task on
● Handles failures due to shuffle output files being
lost (FetchFailed, ExecutorLost)
19. DAGScheduler Responsibilities
● Computes an execution DAG
● Determines the preferred locations to run each
task on
● Handles failures due to shuffle output files being
lost (FetchFailed, ExecutorLost)
● Stage retry
21. TaskSet and TaskSetManager
● What is a TaskSet?
○ Fully independent sequence of task
● Why TaskSetManager
● Responsibilities of TaskSetManager
○ Scheduling tasks in a TaskSet
○ Completion notification
○ Retry and Abort
○ Locality preference
22. TaskScheduler’s Responsibilities
● responsible for submitting tasks for execution for every stage
● works closely with DAGScheduler for resubmission of stage
● tracks the executors in a Spark application (executorHeartBeat and executorLost)