2. About Me
● Software Engineer from 2002
● Devops -> Dev Tooling -> Backend -> Big Data
● Experience with Spark
○ Tens of Spark daily jobs
○ Processing tens of terabytes daily
○ 150K events/sec using Spark streaming with Kafka
● Open Source
○ https://github.com/viyadb
5. Spark Application
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line =>
line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
● Supported languages (by popularity)
○ Scala
○ Python
○ Java
○ R
6. Running Applications
● Distribution
○ Uberjar that includes all the dependencies not available in Spark
● Submitting
○ spark-submit --conf spark.key=value … --class … application.jar
8. RDD - Operations
External
World
Partition 1
Partition 2
Partition K
...
Partition 1
Partition 2
Partition K
...
Partition 1
Partition 2
Partition N
...
External
World
read transform shuffle action
9. RDD - Operations
● Read
○ Database, Message Queue, File Storage, Socket, etc.
● Transform
○ map, flatMap, mapPartitions, filter, etc.
● Shuffle
○ reduceByKey, groupByKey, etc.
● Action
○ show, collect, count, etc.
10. RDD - Operations
● Are lazy: executed only when action is called
● Re-executed whenever an action is called
11. RDD - Creating
// From Scala objects
val data = Array(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data)
// From files
val rdd = sc.textFile("data.txt")
12. RDD - Custom
class MyRDD(...) extends RDD[T](...) {
// method for calculating the data for given partition
override def compute(s: Partition, c: TaskContext): Iterator[T] = {
}
// method for calculating partitions
override protected def getPartitions: Array[Partition] = {
}
}
14. Creating Dataset - Using RDD
// from RDD of objects using reflection
val rdd: RDD[Person] = …
val df = rdd.toDF()
// from RDD with providing schema
val rdd = sc.textFile("people.txt")
val schema = StructType(Seq(
StructField(“name”, StringType, nullable = true), ...))
Val df = spark.createDataFrame(rdd, schema)
15. Creating Dataset - Using Source
// from JSON file
val df = spark.read.json("people.json")
// from Parquet files
val df = spark.read.parquet("people/*.parquet")
17. Querying Dataframe - SQL
val df = spark.read.json(“people.json”)
df.createOrReplaceTempView("people")
val sqlDF = spark.sql(“””
SELECT age, count(*) FROM people
WHERE
name = ‘John’ AND age > 21
GROUP BY
age
“””)
19. Caching
val df = transformData(loadData())
val report1 = calculateReport1(df)
report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”)
val report2 = calculateReport2(df)
report2.write.parquet(s“s3://mycomp-reports/report2/dt=${dt}”)
// Problem: data will be loaded and transformed twice!
20. Caching
val df = transformData(loadData())
// cache calculated data using default storage level
df.cache()
val report1 = calculateReport1(df)
report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”)
val report2 = calculateReport2(df)
report2.write.parquet(s“s3://mycomp-reports/report2/dt=${dt}”)
21. Caching
val df = transformData(loadData())
// cache calculated data using specific storage level
df.persist(StorageLevel.MEMORY_AND_DISK)
val report1 = calculateReport1(df)
report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”)
val report2 = calculateReport2(df)
report2.write.parquet(s“s3://mycomp-reports/report2/dt=${dt}”)
26. Job Scheduling
RDD Objects DAG Scheduler Task Scheduler Executor
RDD
RDDRDD
RDD
RDD
Build DAG of operators
S T T
S T T
S T T
- Split DAG into stages of tasks
- Submit each stage as ready
- Agnostic to operators
Cluster
Scheduler
- Launches tasks
- Retries failed tasks
- Agnostic to stages
Executor
Task
Task
Block manager
- Executes tasks
- Stores and serves blocks
DAG TaskSet Task
35. Receiver Based Approach
Executor
Data Blocks
Receiver
Executor
Data Blocks
Spark Driver
notification on received blocks
receive data
replicate blocks
every batch interval launch task
to process blocks
36. Receiver Based Approach Sample
● https://github.com/spektom/spark-workshop/blob/master/spark-
streams/src/main/scala/com/github/spektom/spark/streams/ReceiverStreamJob.sca
la
41. Aggregation - Structured Streaming
● Window operations based on event time
● Handling late data properly
● Different output sink modes
○ Complete
○ Append
○ Update
42. Aggregation - Watermarking
● Allow for late events counted under a correct bucket
● Too late events are dropped
● Clean in-memory state from irrelevant (old) events
44. Aggregation - Watermarking
val words = ... // { timestamp: Timestamp, word: String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words
.withWatermark("timestamp", "10 minutes") // late events threshold
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word")
.count()
45. Stateful Streaming - mapWithState
● Since Spark 1.6
● State is saved in memory
● Checkpoints to disk after (batchInterval * constant)
val updateState = (batchTime: Time, key: String, value: Option[Int], state:
State[Long]): Option[(String, Long)] => {
val sum = ...
Some((key, sum))
}
val spec = StateSpec.function(updateState)
.intialState(rdd).numPartitions(10).timeout(Seconds(60))
val statefulStream = stream.mapWithState(spec)
46. Stateful Streaming - mapGroupsWithState
● Since Spark 2.2
● Able to keep the state between upgrades
● State versioning allows to only keep latest version in memory
● Incremental checkpoints
48. Output Operations - Wrong
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}
49. Output Operations - Nonoptimal
dstream.foreachRDD { rdd =>
rdd.foreach { record =>
val connection = createNewConnection() // executed at the worker
connection.send(record)
connection.close()
}
}
51. Output Operations - Best
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
// return to the pool for future reuse
ConnectionPool.returnConnection(connection)
}
}
54. Resilience - All Components
All Spark components must be resilient!
● Driver application process
● Master process
● Worker process
● Executor process
● Receiver thread
● Worker node
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
55. Driver
Resilience - Driver
● Client mode
○ Driver application is running inside the “spark-submit” process
○ If this process dies the entire application is killed
● Cluster mode
○ Driver application runs on one of worker nodes
○ “--supervise” option makes driver restart on a different worker node.
● Automatic restart through app launcher
○ Marathon
○ Consul
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
56. Resilience - Master
● Single master
○ The entire application is killed
● Multi-master mode
○ A standby master is elected active
○ Worker nodes automatically register with new master
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
57. Resilience - Worker
● Worker process
○ When failed, all child processes (driver or executor) are killed
○ New worker process is launched automatically
● Executor process
○ Restarted on failure by the parent worker process
● Receiver thread
○ Running inside the Executor process - same as Executor
● Worker node
○ Failure of worker node behaves the
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k
67. Monitoring - Spark Application
● Spark metrics
○ Via statsd (via extension or natively in 2.3.0)
● Application health
○ Spark listeners are helpful
● Application business metrics
73. Optimizing - OS
● Disable transparent hugepages
● Disable host swappiness
● Increase max number of open files
● Tune SSD mount configuration
74. Optimizing - Storage
● Reduce data structures size
○ Enforce Kryo serialization
● Choose appropriate data storage format
○ Parquet
○ Avro
● Use compression
75. Optimizing - Data
● Avoid data skews
● If you can’t avoid them:
○ Use salting
○ Progressive sharding
○ Bin-packing with custom partitioner
76. Optimizing - Code
● Avoid JOINs
● Minimize number of aggregations
● In general, avoid shuffles
● map vs mapPartitions
● groupByKey vs reduceByKey
● etc.
77. Optimizing - Broadcast JOIN
● Prerequisite: one of joined dataset is small
● spark.sql.autoBroadcastJoinThreshold (using Hive metadata)
● broadcast(dataset) as a hint
78. Optimizing - Tips
● Learn to use Spark UI
● Look for optimizations for your specific case in Google
● Use explain() / toDebugString()
The purpose of the talk is to teach theoretical material, but rather have an interactive conversation for understanding of how to improve existing pipelines.
There might be some topics missing, I tried to collect anything important in my opinion. If there’s something that interests you, which is missing please tell.
Maybe the lecture is not organized well, this is the first time I’m presenting it (will love to get your feedback).
Spark is distributed computing and data processing engine
R is a programming language for statistical computing and graphics
We’ll concentrate on Scala from now on
Use class relocation for resolving conflicts
RDD abstraction enables Spark programming model
Number of partitions is determined by:
The source (in HDFS: number of partitions = number of blocks)
Default parallelism setting
Stages are separated by shuffles
If something can start without waiting for previous results - it can run in the same stage
Transform operations run on the same Task
It’s a higher abstraction
Examples of hierarchical data: JSON, class objects, Protobuf messages, etc.
Constructing from RDD:
Inferring schema using reflection
Providing schema
A couple of useful concepts that are crucial for optimizing Spark workloads
Sometimes, saving intermediate results to S3 and loading them later is faster than using Spark caching mechanism.
Checkpointing is extremely helpful in streaming applications, when previous dataset is unioned with the current one. Without it, RDD graph will grow endlessly.
localCheckpoint trades fault tolerance for performance
This is a regular value, not rdd or dataframe
Spark driver is a coordinator of application
Driver and executors are separate JVM processes
Task handles one partition
There’s option to tell how many CPU cores to use per task
Having multiple tasks running on the same executor can help share memory between them
Operation
A standalone application starts and instantiates a SparkContext instance (and it is only then when you can call the application a driver).
The driver program ask for resources to the cluster manager to launch executors.
The cluster manager launches executors.
The driver process runs through the user application. Depending on the actions and transformations over RDDs task are sent to executors.
Executors run the tasks and save the results.
If any worker crashes, its tasks will be sent to different executors to be processed again.
With SparkContext.stop() from the driver or if the main method exits/crashes all the executors will be terminated and the cluster resources will be released by the cluster manager.
Resolving logical plan - resolving references like: views, etc.
Logical optimization: joining two datasets, one has date range, the other hasn’t, and we’re joining on the date. Or, removing unused columns.
Logical plan describes dataset computation without defining how to conduct it
There are two levels of configuration: YARN and Spark
The configuration is complex. Cloudera packages make it simpler.
Automatic configuration is applied based on selected instance types
Dynamic configuration is enabled automatically
Good for single applications running on their own cluster
Good for multitenant clusters
Not really a streaming engine
Advantages
Reuse existing infra
Rich API
Straightforward windowing
It’s easier to implement “exactly once”
Disadvantages
Latency
Receiver is run as Task
Block is defined by block interval
Block manager responsible for storing and replicating blocks
Blocks can be written to WAL to provide fault tolerance
RDD is calculated as part of every micro-batch job
In case of Kafka
Driver computes offsets to fetch from
Every RDD is calculated based on start/end offsets
RDD is KafkaRDD which is responsible for retrieving data per partition
First appeared in Spark 2.0
Complete - dumps everything to the output sink
Append - only writes new rows to the output sink
Update - writes all new and updated rows to the output sink
First appeared in Spark 2.1
This model was first seen in Google Datalow, which later was contributed to Apache Beam
Instead of using Marathon multiple instances of scripts that acquire a lock in Consul can be used
Flame graph is a modern way of call time profiling
Spark UI can help pinpoint things like: data skews, misconfiguration, wrongly written code, etc.
Flame graph is a modern way of call time profiling
GC log must be enabled if you plan to tune GC (and you should)
Explain query shows physical plan
Spark UI can help pinpoint things like: data skews, misconfiguration, wrongly written code, etc.
Flame graph is a modern way of call time profiling
GC log must be enabled if you plan to tune GC (and you should)
Explain query shows physical plan
Spark metrics contain quite everything you’ll find on Spark UI
There’s no rule of thumb, tuning is always according to the workload
Spark has great documentation Website
Partitions number affect how Spark cluster is utilized. There should be 2-3 tasks running per core.
Control number of output files (there’s new coalesce that takes desired files size as input)
spark.sql.shuffle.partitions helps control shuffle parallelism (default: 200). This also controls shuffle block size (whose limit is 2G).
Rule of thumb - 128Mb per partition
If number of partitions is close to 2000 - bump it to 2001 (Spark uses different data structure for bookkeeping during shuffling)
GC must be tuned according to a workload
We used CMS, G1
Some workloads require bigger young generation
Stronger EC2 instances can solve disk/network issues.
Verify that enhanced networking is enabled in kernel (if it’s supported by instance type).
The slide is taken from Amazon AWS slides
Hugepages (HP) created for solving small memory block size issue. THP was aimed to ease the process of configuring HP, but the feature doesn’t work properly - memory becomes fragmented.
Swappiness define swapping frequency. 0 - disable completely, 1 - emergency swappiness (prevents from killing processes)
Kryo serialization is faster and more compact, can speed up by 10%-20% depending on a workload
Columnar formats like Parquet allow reading only required columns. Avro might be better if you always read whole columns.
Different compression algorithms give different compression rates at price of performance. Bottom line: gzip compression is quite good.
Salting: adding random element to partition keys. Howto: Salt keys -> Do the operation on salted keys -> Unsalt keys.
Progressive sharding: iterative process of detecting shard size, then avoiding skewed shards in each iteration
Use windowing functions to minimize number of aggregations (and to avoid joins)
groupByKey first groups then does the reduce, reduceByKey works otherwise thus minimising target data set
There’s plenty of information on how to optimize Spark for specific use case
For example, speeding up data access on S3, etc.