Apache Spark Workshop

Apache Spark Workshop
09/2017, Stockholm

About Me
● Software Engineer from 2002
● Devops -> Dev Tooling -> Backend -> Big Data
● Experience with Spark
○ Tens of Spark daily jobs
○ Processing tens of terabytes daily
○ 150K events/sec using Spark streaming with Kafka
● Open Source
○ https://github.com/viyadb

Spark Ecosystem
Extensions
SQL Streaming MLlib GraphX R
Core Components
Resource Managers
YARN Standalone
Distributed Storage

Spark Application
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line =>
line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
● Supported languages (by popularity)
○ Scala
○ Python
○ Java
○ R

Running Applications
● Distribution
○ Uberjar that includes all the dependencies not available in Spark
● Submitting
○ spark-submit --conf spark.key=value … --class … application.jar

RDD
● Abstraction of distributed data-set
● Scala array-like data structure
○ RDD[String]
○ RDD[(String, Int)]
○ RDD[User]
○ …

RDD - Operations
External
World
Partition 1
Partition 2
Partition K
...
Partition 1
Partition 2
Partition K
...
Partition 1
Partition 2
Partition N
...
External
World
read transform shuffle action

RDD - Operations
● Read
○ Database, Message Queue, File Storage, Socket, etc.
● Transform
○ map, flatMap, mapPartitions, filter, etc.
● Shuffle
○ reduceByKey, groupByKey, etc.
● Action
○ show, collect, count, etc.

RDD - Operations
● Are lazy: executed only when action is called
● Re-executed whenever an action is called

RDD - Creating
// From Scala objects
val data = Array(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data)
// From files
val rdd = sc.textFile("data.txt")

RDD - Custom
class MyRDD(...) extends RDD[T](...) {
// method for calculating the data for given partition
override def compute(s: Partition, c: TaskContext): Iterator[T] = {
}
// method for calculating partitions
override protected def getPartitions: Array[Partition] = {
}
}

Dataset
● Structured hierarchical data with schema
● Supports RDD and SQL operations
● Dataframe is an alias of Dataset[Row]

Creating Dataset - Using RDD
// from RDD of objects using reflection
val rdd: RDD[Person] = …
val df = rdd.toDF()
// from RDD with providing schema
val rdd = sc.textFile("people.txt")
val schema = StructType(Seq(
StructField(“name”, StringType, nullable = true), ...))
Val df = spark.createDataFrame(rdd, schema)

Creating Dataset - Using Source
// from JSON file
val df = spark.read.json("people.json")
// from Parquet files
val df = spark.read.parquet("people/*.parquet")

Querying Dataframe - DSL
val df = spark.read.json(“people.json”)
val otherDf = df.select($”name”, $”age”)
.filter($”name” === “John” && $”age” > 21)
.groupBy(“age”)
.count()

Querying Dataframe - SQL
val df = spark.read.json(“people.json”)
df.createOrReplaceTempView("people")
val sqlDF = spark.sql(“””
SELECT age, count(*) FROM people
WHERE
name = ‘John’ AND age > 21
GROUP BY
age
“””)

Caching
val df = transformData(loadData())
val report1 = calculateReport1(df)
report1.write.parquet(s“s3://mycomp-reports/report1/dt=${dt}”)
// Problem: data will be loaded and transformed twice!

Caching
// cache calculated data using default storage level
df.cache()

Caching
// cache calculated data using specific storage level
df.persist(StorageLevel.MEMORY_AND_DISK)

Checkpointing
RDD
RDDRDD
RDD
RDD
Lineage
rdd.checkpoint()
Storage
● Stores RDD contents to disk
● The way to break the lineage
● There’s Dataset.checkpoint() starting from
Spark 2.1
● In memory: RDD.localCheckpoint()

Broadcasting
Worker
Executor
Task Task
Worker
Executor
Task Task
sc.broadcast(val)
val
val
make value available on all executors

Spark Components
Spark Driver
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Cluster Manager
Worker Node
Executor
Task Task
Worker Node
Executor
Task Task

Job Scheduling
RDD Objects DAG Scheduler Task Scheduler Executor
RDD
RDDRDD
RDD
RDD
Build DAG of operators
S T T
S T T
S T T
- Split DAG into stages of tasks
- Submit each stage as ready
- Agnostic to operators
Cluster
Scheduler
- Launches tasks
- Retries failed tasks
- Agnostic to stages
Executor
Task
Task
Block manager
- Executes tasks
- Stores and serves blocks
DAG TaskSet Task

Memory Management in YARN (EMR)
Node Memory
OS Overhead
Hadoop
Components
(HDFS Data
Node, etc.)
Yarn Container #1Spark Off-
Heap
Memory
…
Yarn Container Memory Overhead
Spark Executor Memory
Shuffle Memory
Fraction
Storage
Memory
Fraction

YARN Configuration
● Rule of thumb
○ 5 cores per executor
● Automatic configuration in EMR
○ maximizeResourceAllocation = true

Resource Allocation - Static
--executor-memory 1G --num-executors 10
Resource (Memory/CPU)
Time
Allocated
Used

Resource Allocation - Dynamic
spark.dynamicAllocation.enabled=true
Resource (Memory/CPU)
Time
Allocated
Used

Method of Operation
Input Stream Micro batches
Spark Engine
Processed data

Receiver Based Approach
Executor
Data Blocks
Receiver
Executor
Data Blocks
Spark Driver
notification on received blocks
receive data
replicate blocks
every batch interval launch task
to process blocks

Receiver Based Approach Sample
● https://github.com/spektom/spark-workshop/blob/master/spark-
streams/src/main/scala/com/github/spektom/spark/streams/ReceiverStreamJob.sca
la

Direct Approach
ExecutorSpark Driver
schedule next micro-batch job compute RDD

Direct Approach Sample
● https://github.com/spektom/spark-workshop/blob/master/spark-
streams/src/main/scala/com/github/spektom/spark/streams/DirectStreamJob.scala

Aggregation - Windowing
window length
sliding interval
batch interval
DStream

Aggregation - Structured Streaming
● Window operations based on event time
● Handling late data properly
● Different output sink modes
○ Complete
○ Append
○ Update

Aggregation - Watermarking
● Allow for late events counted under a correct bucket
● Too late events are dropped
● Clean in-memory state from irrelevant (old) events

Aggregation - Watermarking
val words = ... // { timestamp: Timestamp, word: String }
// Group the data by window and word and compute the count of each group
val windowedCounts = words
.withWatermark("timestamp", "10 minutes") // late events threshold
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"word")
.count()

Stateful Streaming - mapWithState
● Since Spark 1.6
● State is saved in memory
● Checkpoints to disk after (batchInterval * constant)
val updateState = (batchTime: Time, key: String, value: Option[Int], state:
State[Long]): Option[(String, Long)] => {
val sum = ...
Some((key, sum))
}
val spec = StateSpec.function(updateState)
.intialState(rdd).numPartitions(10).timeout(Seconds(60))
val statefulStream = stream.mapWithState(spec)

Stateful Streaming - mapGroupsWithState
● Since Spark 2.2
● Able to keep the state between upgrades
● State versioning allows to only keep latest version in memory
● Incremental checkpoints

Output Operations - Wrong
dstream.foreachRDD { rdd =>
val connection = createNewConnection() // executed at the driver
rdd.foreach { record =>
connection.send(record) // executed at the worker
}
}

Output Operations - Nonoptimal
rdd.foreach { record =>
val connection = createNewConnection() // executed at the worker
connection.send(record)
connection.close()
}
}

Output Operations - Better
rdd.foreachPartition { partitionOfRecords =>
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
}
}

Output Operations - Best
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
// return to the pool for future reuse
ConnectionPool.returnConnection(connection)
}
}

Streaming Application Goals
● Have processing time less than batch interval
● Handle failures properly

Resilience - All Components
All Spark components must be resilient!
● Driver application process
● Master process
● Worker process
● Executor process
● Receiver thread
● Worker node
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k

Driver
Resilience - Driver
● Client mode
○ Driver application is running inside the “spark-submit” process
○ If this process dies the entire application is killed
● Cluster mode
○ Driver application runs on one of worker nodes
○ “--supervise” option makes driver restart on a different worker node.
● Automatic restart through app launcher
○ Marathon
○ Consul
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k

Resilience - Master
● Single master
○ The entire application is killed
● Multi-master mode
○ A standby master is elected active
○ Worker nodes automatically register with new master
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k

Resilience - Worker
● Worker process
○ When failed, all child processes (driver or executor) are killed
○ New worker process is launched automatically
● Executor process
○ Restarted on failure by the parent worker process
● Receiver thread
○ Running inside the Executor process - same as Executor
● Worker node
○ Failure of worker node behaves the
Driver
Master
Worker Node
Executor
Tas
k
Tas
k
Worker Node
Executor
Tas
k
Tas
k

Profiling - Spark UI
● Data skews
● Misconfiguration
● Memory issues
● Nonoptimal computations
● Streaming health

Profiling - Flame Graph
Easy way: https://github.com/spektom/spark-flamegraph

Profiling - GC Log
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

Profiling - “Explain” Queries
dataframe.explain (rdd.toDebugString)

Monitoring
Cluster Manager
Workflow Manager
Spark Application
Sinks
OS
StatsD Graphite
Automation

Monitoring - Cluster Manager
● Report on underutilized resources (cost)
● Implement dynamic scaling (failover, cost)

Monitoring - Workflow Manager
● Jobs SLA
● Execution times
● Failures

Monitoring - Spark Application
● Spark metrics
○ Via statsd (via extension or natively in 2.3.0)
● Application health
○ Spark listeners are helpful
● Application business metrics

Optimizing
● Start from here: http://spark.apache.org/docs/latest/tuning.html
● YARN-based clusters
○ Cloudera blog / documentation

Optimizing - Partitions Number
● Number of files/blocks is important
● Size of file/block is important
● Control shuffle partitions number

Optimizing - GC
● Try different GC kind
● Set Young Generation size
● Where most of your objects live?

Optimizing - Amazon AWS
● Choose appropriate EC2 Instance type
● Prefer attached SSD
● Verify enhanced networking is enabled

Optimizing - OS
● Disable transparent hugepages
● Disable host swappiness
● Increase max number of open files
● Tune SSD mount configuration

Optimizing - Storage
● Reduce data structures size
○ Enforce Kryo serialization
● Choose appropriate data storage format
○ Parquet
○ Avro
● Use compression

Optimizing - Data
● Avoid data skews
● If you can’t avoid them:
○ Use salting
○ Progressive sharding
○ Bin-packing with custom partitioner

Optimizing - Code
● Avoid JOINs
● Minimize number of aggregations
● In general, avoid shuffles
● map vs mapPartitions
● groupByKey vs reduceByKey
● etc.

Optimizing - Broadcast JOIN
● Prerequisite: one of joined dataset is small
● spark.sql.autoBroadcastJoinThreshold (using Hive metadata)
● broadcast(dataset) as a hint

Optimizing - Tips
● Learn to use Spark UI
● Look for optimizations for your specific case in Google
● Use explain() / toDebugString()

Resources
● https://spark.apache.org/documentation.html
● https://jaceklaskowski.gitbooks.io/mastering-apache-spark

Apache Spark Workshop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Apache Spark Workshop

Semelhante a Apache Spark Workshop (20)

Último

Último (20)

Apache Spark Workshop

Notas do Editor